I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
Related
I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?
Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.
You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Łukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?
Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.
Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));
heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);
I'm working on a project that involves taking large text files and parsing each line. The point is to parse the whole text file into cells, much like an Excel spreadsheet. Unfortunately, there are no delimiters for most of the files, so I need some sort of index-based method to manually create the cells, even if the column is blank.
Previously, lines were parsed by splitting on null, which worked well. However, new data has made this method unreliable due to its not including blank cells, so I had to make a new method of parsing lines, which uses Substring. The method takes in an array of integers indices and splits the strings on the given indices:
private string[] SetCols3(int[] fixedWidthValues, string line)
{
{
string[] cols = new string[fixedWidthValues.Length];
int columnLength;
int FWV;
int FWV2;
bool lastOfFWV;
bool outOfBounds;
for (int x = 0; x < fixedWidthValues.Length; x++)
{
FWV = fixedWidthValues[x];
lastOfFWV = x + 1 >= fixedWidthValues.Length;
outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;
FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1];
columnLength = FWV2 - FWV;
columnLength *= columnLength < 0 ? -1 : 1;
if (FWV < line.Length)
{
cols[x] = line.Substring(FWV, columnLength).Trim();
}
}
return cols;
}
Quick breakdown of the code: the integers and booleans are just to handle blank columns, lines that are shorter than normal, etc., and to make the code cleaner for other people to understand a little better (as opposed to one long, convoluted if statement).
My question: is there a way to make this more efficient? For some reason, this method takes significantly longer than the previous method. I understand it does more, so more time was expected. However, the difference is surprisingly huge. One iteration (with 15 indices) takes around 0.07 seconds (which is huge considering this method gets called several thousands time per file), compared to 0.00002 seconds on the high end for the method that splits on null. Is there something I can change in my code to noticeably increase its efficiency? I haven't been able to find anything particularly useful after hours of searching online.
Also, the number of indices/columns greatly affects the speed. For 15 columns, it takes around 0.07 seconds compared to 0.05 for 10 columns.
First,
outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;
could be changed to
outOfBounds = lastOfFWV || fixedWidthValues[x + 1] >= line.Length;
Next,
columnLength = FWV2 - FWV;
columnLength *= columnLength < 0 ? -1 : 1;
could be changed to
columnLength = Math.Abs(FWV2 - FWV);
And last,
if (FWV < line.Length)
{
could be moved to just after the FWV assignment at the top of the loop and changed to
if (FWV < line.Length)
continue;
But, I don't think any of these changes would make a significant impact on speed. Possibly more impact would be gained by changing what's passed in. Instead of passing in the column starting positions and calculating the column widths for each line, which won't change, pass in the starting positions and column widths. This way there's no calculation involved.
But rather than guessing, it'd be best to profile the method to find the hot spot(s).
The issue was two stray .ToInt32() calls I accidentally included (I don't know why they were there). This particular method was a different method, one from my company, than the Convert.ToInt32(), and for some reason it was majorly inefficient when trying to convert numbers. For reference, the issues was on the following lines as follows:
FWV = fixedWidthValues[x].ToInt32();
...
FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1].ToInt32();
Removing them increased the efficiency by 60 times...
I understand the benefits of StringBuilder.
But if I want to concatenate 2 strings, then I assume that it is better (faster) to do it without StringBuilder. Is this correct?
At what point (number of strings) does it become better to use StringBuilder?
I warmly suggest you to read The Sad Tragedy of Micro-Optimization Theater, by Jeff Atwood.
It treats Simple Concatenation vs. StringBuilder vs. other methods.
Now, if you want to see some numbers and graphs, follow the link ;)
But if I want to concatinate 2
strings, then I assume that it is
better (faster) to do it without
StringBuilder. Is this correct?
That is indeed correct, you can find why exactly explained very well on :
Article about strings and StringBuilder
Summed up : if you can concatinate strings in one go like
var result = a + " " + b + " " + c + ..
you are better off without StringBuilder for only on copy is made (the length of the resulting string is calculated beforehand.);
For structure like
var result = a;
result += " ";
result += b;
result += " ";
result += c;
..
new objects are created each time, so there you should consider StringBuilder.
At the end the article sums up these rules of thumb :
Rules Of Thumb
So, when should you use StringBuilder,
and when should you use the string
concatenation operators?
Definitely use StringBuilder when
you're concatenating in a non-trivial
loop - especially if you don't know
for sure (at compile time) how many
iterations you'll make through the
loop. For example, reading a file a
character at a time, building up a
string as you go using the += operator
is potentially performance suicide.
Definitely use the concatenation
operator when you can (readably)
specify everything which needs to be
concatenated in one statement. (If you
have an array of things to
concatenate, consider calling
String.Concat explicitly - or
String.Join if you need a delimiter.)
Don't be afraid to break literals up
into several concatenated bits - the
result will be the same. You can aid
readability by breaking a long literal
into several lines, for instance, with
no harm to performance.
If you need the intermediate results
of the concatenation for something
other than feeding the next iteration
of concatenation, StringBuilder isn't
going to help you. For instance, if
you build up a full name from a first
name and a last name, and then add a
third piece of information (the
nickname, maybe) to the end, you'll
only benefit from using StringBuilder
if you don't need the (first name +
last name) string for other purpose
(as we do in the example which creates
a Person object).
If you just have a few concatenations
to do, and you really want to do them
in separate statements, it doesn't
really matter which way you go. Which
way is more efficient will depend on
the number of concatenations the sizes
of string involved, and what order
they're concatenated in. If you really
believe that piece of code to be a
performance bottleneck, profile or
benchmark it both ways.
System.String is an immutable object - it means that whenever you modify its content it will allocate a new string and this takes time (and memory?).
Using StringBuilder you modify the actual content of the object without allocating a new one.
So use StringBuilder when you need to do many modifications on the string.
Not really...you should use StringBuilder if you concatenate large strings or you have many concatenations, like in a loop.
If you concatenate strings in a loop, you should consider using StringBuilder instead of regular String
In case it's single concatenation, you may not see the difference in execution time at all
Here is a simple test app to prove the point:
static void Main(string[] args)
{
//warm-up rounds:
Test(500);
Test(500);
//test rounds:
Test(500);
Test(1000);
Test(10000);
Test(50000);
Test(100000);
Console.ReadLine();
}
private static void Test(int iterations)
{
int testLength = iterations;
Console.WriteLine($"----{iterations}----");
//TEST 1 - String
var startTime = DateTime.Now;
var resultString = "test string";
for (var i = 0; i < testLength; i++)
{
resultString += i.ToString();
}
Console.WriteLine($"STR: {(DateTime.Now - startTime).TotalMilliseconds}");
//TEST 2 - StringBuilder
startTime = DateTime.Now;
var stringBuilder = new StringBuilder("test string");
for (var i = 0; i < testLength; i++)
{
stringBuilder.Append(i.ToString());
}
string resultString2 = stringBuilder.ToString();
Console.WriteLine($"StringBuilder: {(DateTime.Now - startTime).TotalMilliseconds}");
Console.WriteLine("---------------");
Console.WriteLine("");
}
Results (in milliseconds):
----500----
STR: 0.1254
StringBuilder: 0
---------------
----1000----
STR: 2.0232
StringBuilder: 0
---------------
----10000----
STR: 28.9963
StringBuilder: 0.9986
---------------
----50000----
STR: 1019.2592
StringBuilder: 4.0079
---------------
----100000----
STR: 11442.9467
StringBuilder: 10.0363
---------------
There's no definitive answer, only rules-of-thumb. My own personal rules go something like this:
If concatenating in a loop, always use a StringBuilder.
If the strings are large, always use a StringBuilder.
If the concatenation code is tidy and readable on the screen then it's probably ok.
If it isn't, use a StringBuilder.
To paraphrase
Then shalt thou count to three, no more, no less. Three shall be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, neither count thou two, excepting that thou then proceed to three. Once the number three, being the third number, be reached, then lobbest thou thy Holy Hand Grenade of Antioch
I generally use string builder for any block of code which would result in the concatenation of three or more strings.
Since it's difficult to find an explanation for this that's not either influenced by opinions or followed by a battle of prides I thought to write a bit of code on LINQpad to test this myself.
I found that using small sized strings rather than using i.ToString() changes response times (visible in small loops).
The test uses different sequences of iterations to keep time measurements in sensibly comparable ranges.
I'll copy the code at the end so you can try it yourself (results.Charts...Dump() won't work outside LINQPad).
Output (X-Axis: Number of iterations tested, Y-Axis: Time taken in ticks):
Iterations sequence: 2, 3, 4, 5, 6, 7, 8, 9, 10
Iterations sequence: 10, 20, 30, 40, 50, 60, 70, 80
Iterations sequence: 100, 200, 300, 400, 500
Code (Written using LINQPad 5):
void Main()
{
Test(2, 3, 4, 5, 6, 7, 8, 9, 10);
Test(10, 20, 30, 40, 50, 60, 70, 80);
Test(100, 200, 300, 400, 500);
}
void Test(params int[] iterationsCounts)
{
$"Iterations sequence: {string.Join(", ", iterationsCounts)}".Dump();
int testStringLength = 10;
RandomStringGenerator.Setup(testStringLength);
var sw = new System.Diagnostics.Stopwatch();
var results = new Dictionary<int, TimeSpan[]>();
// This call before starting to measure time removes initial overhead from first measurement
RandomStringGenerator.GetRandomString();
foreach (var iterationsCount in iterationsCounts)
{
TimeSpan elapsedForString, elapsedForSb;
// string
sw.Restart();
var str = string.Empty;
for (int i = 0; i < iterationsCount; i++)
{
str += RandomStringGenerator.GetRandomString();
}
sw.Stop();
elapsedForString = sw.Elapsed;
// string builder
sw.Restart();
var sb = new StringBuilder(string.Empty);
for (int i = 0; i < iterationsCount; i++)
{
sb.Append(RandomStringGenerator.GetRandomString());
}
sw.Stop();
elapsedForSb = sw.Elapsed;
results.Add(iterationsCount, new TimeSpan[] { elapsedForString, elapsedForSb });
}
// Results
results.Chart(r => r.Key)
.AddYSeries(r => r.Value[0].Ticks, LINQPad.Util.SeriesType.Line, "String")
.AddYSeries(r => r.Value[1].Ticks, LINQPad.Util.SeriesType.Line, "String Builder")
.DumpInline();
}
static class RandomStringGenerator
{
static Random r;
static string[] strings;
public static void Setup(int testStringLength)
{
r = new Random(DateTime.Now.Millisecond);
strings = new string[10];
for (int i = 0; i < strings.Length; i++)
{
strings[i] = Guid.NewGuid().ToString().Substring(0, testStringLength);
}
}
public static string GetRandomString()
{
var indx = r.Next(0, strings.Length);
return strings[indx];
}
}
But if I want to concatenate 2 strings, then I assume that it's better and faster to do so without StringBuilder. Is this correct?
Yes. But more importantly, it is vastly more readable to use a vanilla String in such situations. Using it in a loop, on the other hand, makes sense and can also be as readable as concatenation.
I’d be wary of rules of thumb that cite specific numbers of concatenation as a threshold. Using it in loops (and loops only) is probably just as useful, easier to remember and makes more sense.
As long as you can physically type the number of concatenations (a + b + c ...) it shouldn't make a big difference. N squared (at N = 10) is a 100X slowdown, which shouldn't be too bad.
The big problem is when you are concatenating hundreds of strings. At N=100, you get a 10000X times slowdown. Which is pretty bad.
A single concatenation is not worth using a StringBuilder. I've typically used 5 concatenations as a rule of thumb.
I don't think there's a fine line between when to use or when not to. Unless of course someone performed some extensive testings to come out with the golden conditions.
For me, I will not use StringBuilder if just concatenating 2 huge strings. If there's loop with an undeterministic count, I'm likely to, even if the loop might be small counts.
How many parameters can you pass to a string.Format() method?
There must be some sort of theoretical or enforced limit on it. Is it based on the limits of the params[] type or the memory usage of the app that is using it or something else entirely?
OK, I emerge from hiding... I used the following program to verify what was going on and while Marc pointed out that a string like this "{0}{1}{2}...{2147483647}" would succeed the memory limit of 2 GiB before the argument list, my findings did't match yours. Thus the hard limit, of the number of parameters you can put in a string.Format method call has to be 107713904.
int i = 0;
long sum = 0;
while (sum < int.MaxValue)
{
var s = sizeof(char) * ("{" + i + "}").Length;
sum += s; // pseudo append
++i;
}
Console.WriteLine(i);
Console.ReadLine();
Love the discussion people!
Not as far as I know...
well, the theoretical limit would be the int32 limit for the array, but you'd hit the string length limit long before that, I guess...
Just don't go mad with it ;-p It may be better to write lots of small fragments to (for example) a file or response, than one huge hit.
edit - it looked like there was a limit in the IL (0xf4240), but apparently this isn't quite as it appears; I can make it get quite large (2^24) before I simply run out of system memory...
Update; it seems to me that the bounding point is the format string... those {1000001}{1000002} add up... a quick bit of math (below) shows that the maximum useful number of arguments we can use is 206,449,129:
long remaining = 2147483647;// max theoretical format arg length
long count = 10; // i.e. {0}-{9}
long len = 1;
int total = 0;
while (remaining >= 0) {
for(int i = 0 ; i < count && remaining >= 0; i++) {
total++;
remaining -= len + 2; // allow for {}
}
count *= 10;
len++;
}
Console.WriteLine(total - 1);
Expanding on Marc's detailed answer.
The only other limitation that is important is for the debugger. Once you pass a certain number of parameters directly to a function, the debugger becomes less functional in that method. I believe the limit is 64 parameters.
Note: This does not mean an array with 64 members, but 64 parameters passed directly to the function.
You might laugh and say "who would do this?" which is certainly a valid question. Yet LINQ makes this a lot easier than you think. Under the hood in LINQ the compiler generates a lot of code. It's possible that for a large generate SQL query where more than 64 fields are selected that you would hit this issue. Because the compiler under the hood would need to pass all of the fields to the constructor of an anonymous type.
Still a corner case.
Considering that both the limit of the Array class and the String class are the upper limit of Int32 (documented at 2,147,483,647 here: Int32 Structure), it is reasonable to believe that this value is the limit of the number string of format parameters.
Update Upon checking reflector, John is right. String.Format, using the Red Gate Reflector, shows the ff:
public static string Format(IFormatProvider provider, string format, params object[] args)
{
if ((format == null) || (args == null))
{
throw new ArgumentNullException((format == null) ? "format" : "args");
}
StringBuilder builder = new StringBuilder(format.Length + (args.Length * 8));
builder.AppendFormat(provider, format, args);
return builder.ToString();
}
The format.Length + (args.Length * 8) part of the code is enough to kill most of that number. Ergo, '2,147,483,647 = x + 8x' leaves us with x = 238,609,294 (theoretical).
It's far less than that of course; as the guys in the comments mentioned the string hitting the string length limit earlier is quite likely.
Maybe someone should just code this into a machine problem! :P
Any thoughts on the efficiency of this? ...
CommentText.ToCharArray().Where(c => c >= 'A' && c <= 'Z').Count()
Ok, just knocked up some code to time your method against this:
int count = 0;
for (int i = 0; i < s.Length; i++)
{
if (char.IsUpper(s[i])) count++;
}
The result:
Yours: 19737 ticks
Mine: 118 ticks
Pretty big difference! Sometimes the most straight-forward way is the most efficient.
Edit
Just out of interest, this:
int count = s.Count(c => char.IsUpper(c));
Comes in at at around 2500 ticks. So for a "Linqy" one-liner it's pretty quick.
First there is no reason you need to call ToCharArray() since, assuming CommentText is a string it is already an IEnumerable<char>. Second, you should probably be calling char.IsUpper instead of assuming you are only dealing with ASCII values. The code should really look like,
CommentText.Count(char.IsUpper)
Third, if you are worried about speed there isn't much that can beat the old for loop,
int count = 0;
for (int i = 0; i < CommentText.Length; i++)
if (char.IsUpper(CommentText[i]) count++;
In general, calling any method is going to be slower than inlining the code but this kind of optimization should only be done if you are absolutely sure this is the bottle-neck in your code.
You're only counting standard ASCII and not ÃÐÊ etc.
How about
CommentText.ToCharArray().Where(c => Char.IsUpper(c)).Count()
Without even testing I'd say
int count = 0;
foreach (char c in commentText)
{
if (Char.IsUpper(c))
count++;
}
is faster, off now to test it.
What you are doing with that code is to create a collection with the characters, then create a new collection containing only the uppercase characters, then loop through that collection only to find out how many there are.
This will perform better (but still not quite as good as a plain loop), as it doesn't create the intermediate collections:
CommentText.Count(c => Char.IsUpper(c))
Edit: Removed the ToCharArray call also, as Matt suggested.
I've got this
Regex x = new Regex("[A-Z]{1}",
RegexOptions.Compiled | RegexOptions.CultureInvariant);
int c = x.Matches(s).Count;
but I don't know if its particularly quick. It won't get special chars either, I s'pose
EDIT:
Quick comparison to this question's answer. Debug in vshost, 10'000 iterations with the string:
aBcDeFGHi1287jKK6437628asghwHllmTbynerA
The answer: 20-30 ms
The regex solution: 140-170 ms