I'm working on a project that involves taking large text files and parsing each line. The point is to parse the whole text file into cells, much like an Excel spreadsheet. Unfortunately, there are no delimiters for most of the files, so I need some sort of index-based method to manually create the cells, even if the column is blank.
Previously, lines were parsed by splitting on null, which worked well. However, new data has made this method unreliable due to its not including blank cells, so I had to make a new method of parsing lines, which uses Substring. The method takes in an array of integers indices and splits the strings on the given indices:
private string[] SetCols3(int[] fixedWidthValues, string line)
{
{
string[] cols = new string[fixedWidthValues.Length];
int columnLength;
int FWV;
int FWV2;
bool lastOfFWV;
bool outOfBounds;
for (int x = 0; x < fixedWidthValues.Length; x++)
{
FWV = fixedWidthValues[x];
lastOfFWV = x + 1 >= fixedWidthValues.Length;
outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;
FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1];
columnLength = FWV2 - FWV;
columnLength *= columnLength < 0 ? -1 : 1;
if (FWV < line.Length)
{
cols[x] = line.Substring(FWV, columnLength).Trim();
}
}
return cols;
}
Quick breakdown of the code: the integers and booleans are just to handle blank columns, lines that are shorter than normal, etc., and to make the code cleaner for other people to understand a little better (as opposed to one long, convoluted if statement).
My question: is there a way to make this more efficient? For some reason, this method takes significantly longer than the previous method. I understand it does more, so more time was expected. However, the difference is surprisingly huge. One iteration (with 15 indices) takes around 0.07 seconds (which is huge considering this method gets called several thousands time per file), compared to 0.00002 seconds on the high end for the method that splits on null. Is there something I can change in my code to noticeably increase its efficiency? I haven't been able to find anything particularly useful after hours of searching online.
Also, the number of indices/columns greatly affects the speed. For 15 columns, it takes around 0.07 seconds compared to 0.05 for 10 columns.
First,
outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;
could be changed to
outOfBounds = lastOfFWV || fixedWidthValues[x + 1] >= line.Length;
Next,
columnLength = FWV2 - FWV;
columnLength *= columnLength < 0 ? -1 : 1;
could be changed to
columnLength = Math.Abs(FWV2 - FWV);
And last,
if (FWV < line.Length)
{
could be moved to just after the FWV assignment at the top of the loop and changed to
if (FWV < line.Length)
continue;
But, I don't think any of these changes would make a significant impact on speed. Possibly more impact would be gained by changing what's passed in. Instead of passing in the column starting positions and calculating the column widths for each line, which won't change, pass in the starting positions and column widths. This way there's no calculation involved.
But rather than guessing, it'd be best to profile the method to find the hot spot(s).
The issue was two stray .ToInt32() calls I accidentally included (I don't know why they were there). This particular method was a different method, one from my company, than the Convert.ToInt32(), and for some reason it was majorly inefficient when trying to convert numbers. For reference, the issues was on the following lines as follows:
FWV = fixedWidthValues[x].ToInt32();
...
FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1].ToInt32();
Removing them increased the efficiency by 60 times...
Related
I have recently come across this line of code and what it does is that it goes through an array and returns the value that is seen most often. For example 1,1,2,1,3 so it will return 1 because it appears more than 2 and 3. What I am trying to do is understand how it works so what I did was I went through it with visual studio step by step but it is not ringing any bells.
Can anyone help me understand what is going on here? It would be a total plus if someone can tell me what does c do and what is the logic behind the arguments in the if statements.
int[] arr = a;
int c = 1, maxcount = 1, maxvalue = 0;
int result = 0;
for (int i = 0; i < arr.Length; i++)
{
maxvalue = arr[i];
for (int j = 0; j <arr.Length; j++)
{
if (maxvalue == arr[j] && j != i)
{
c++;
if (c > maxcount)
{
maxcount = c;
result = arr[i];
}
}
else
{
c=1;
}
}
}
return result;
EDIT: On closer examination, the code snippet has a nested loop and is conventionally counting the maximum seen element by simply keeping track of the maximum seen times and the element that was seen and keeping them in sync.
That looks like an implementation of the Boyer-Moore majority vote counting algorithm. They have a nice illustration here.
The logic is simple, and is to compute the majority in a single pass, taking O(n) time. Note that majority here means that more than 50% of the array must be filled with that element. If there is no majority element, you get an "incorrect" result.
Verifying if the element is actually forming a majority is done in a separate pass usually.
It is not computing the most frequent element - what it is computing is the longest run of elements.
Also, it is not doing it very efficiently, the inner loop only needs to compute upto i-1, not upto arr.Length.
c is keeping track of the current run length. The first "if" is to check if this is a "continouous run". The second "if" (after reaching the last element in the run) will check if this run is longer than any run you have seen so far.
In the above input sample, you are getting 1 as answer because it is the longest run. Try with an input where the element with the longest run is not the same as the most frequent element. (e.g., 2,1,1,1,3,2,3,2,3,2,3,2 - here 2 is the most frequent element, but 1,1,1 is the longest run).
I've always loved reducing number of code lines by using simple but smart math approaches. This situation seems to be one of those that need this approach. So what I basically need is to sum up digits in the odd and even places separately with minimum code. So far this is the best way I have been able to think of:
string number = "123456789";
int sumOfDigitsInOddPlaces=0;
int sumOfDigitsInEvenPlaces=0;
for (int i=0;i<number.length;i++){
if(i%2==0)//Means odd ones
sumOfDigitsInOddPlaces+=number[i];
else
sumOfDigitsInEvenPlaces+=number[i];
{
//The rest is not important
Do you have a better idea? Something without needing to use if else
int* sum[2] = {&sumOfDigitsInOddPlaces,&sumOfDigitsInEvenPlaces};
for (int i=0;i<number.length;i++)
{
*(sum[i&1])+=number[i];
}
You could use two separate loops, one for the odd indexed digits and one for the even indexed digits.
Also your modulus conditional may be wrong, you're placing the even indexed digits (0,2,4...) in the odd accumulator. Could just be that you're considering the number to be 1-based indexing with the number array being 0-based (maybe what you intended), but for algorithms sake I will consider the number to be 0-based.
Here's my proposition
number = 123456789;
sumOfDigitsInOddPlaces=0;
sumOfDigitsInEvenPlaces=0;
//even digits
for (int i = 0; i < number.length; i = i + 2){
sumOfDigitsInEvenPlaces += number[i];
}
//odd digits, note the start at j = 1
for (int j = 1; i < number.length; i = i + 2){
sumOfDigitsInOddPlaces += number[j];
}
On the large scale this doesn't improve efficiency, still an O(N) algorithm, but it eliminates the branching
Since you added C# to the question:
var numString = "123456789";
var odds = numString.Split().Where((v, i) => i % 2 == 1);
var evens = numString.Split().Where((v, i) => i % 2 == 0);
var sumOfOdds = odds.Select(int.Parse).Sum();
var sumOfEvens = evens.Select(int.Parse).Sum();
Do you like Python?
num_string = "123456789"
odds = sum(map(int, num_string[::2]))
evens = sum(map(int, num_string[1::2]))
This Java solution requires no if/else, has no code duplication and is O(N):
number = "123456789";
int[] sums = new int[2]; //sums[0] == sum of even digits, sums[1] == sum of odd
for(int arrayIndex=0; arrayIndex < 2; ++arrayIndex)
{
for (int i=0; i < number.length()-arrayIndex; i += 2)
{
sums[arrayIndex] += Character.getNumericValue(number.charAt(i+arrayIndex));
}
}
Assuming number.length is even, it is quite simple. Then the corner case is to consider the last element if number is uneven.
int i=0;
while(i<number.length-1)
{
sumOfDigitsInEvenPlaces += number[ i++ ];
sumOfDigitsInOddPlaces += number[ i++ ];
}
if( i < number.length )
sumOfDigitsInEvenPlaces += number[ i ];
Because the loop goes over i 2 by 2, if number.length is even, removing 1 does nothing.
If number.length is uneven, it removes the last item.
If number.length is uneven, then the last value of i when exiting the loop is that of the not yet visited last element.
If number.length is uneven, by modulo 2 reasoning, you have to add the last item to sumOfDigitsInEvenPlaces.
This seems slightly more verbose, but also more readable, to me than Anonymous' (accepted) answer. However, benchmarks to come.
Well, the compiler seems to think my code more understandable as well, since he removes it all if I don't print the results (which explains why I kept getting a time of 0 all along...). The other code though is obfuscated enough for even the compiler.
In the end, even with huge arrays, it's pretty hard for clock_t to tell the difference between the two. You get about a third less instructions in the second case, but since everything's in cache (and your running sums even in registers) it doesn't matter much.
For the curious, I've put the disassembly of both versions (compiled from C) here : http://pastebin.com/2fciLEMw
I've searched online for a diff algorithm but none of them do what I am looking for. It is for a texting contest (as in cell phone) and I need the entry text compared to the master text recording the errors along the way. I am semi-new to C# and I get most of the string functions and didn't think this was going to be that hard of a problem, but alas I just can't wrap my head around it.
I have a form with 2 rich-text-boxes (one on top of the other) and 2 buttons. The top box is the master text (string) and the bottom box is the entry text (string). Every contestant is sending a text to an email account, from the email we copy and paste the text into the Entry RTB and compare to the Master RTB. For each single word and single space counts as a thing to check. A word, no matter how many errors it has, is still 1 error. And for every error add 1 sec. to their time.
Examples:
Hello there! <= 3 checks (2 words and 1 space)
Helothere! <= 2 errors (Helo and space)
Hello there!! <= 1 error (extra ! at end of there!)
Hello there! How are you? <= 9 checks (5 words and 4 spaces)
Helothere!! How a re you? <= still 9 checks, 4 errors(helo, no space, extra !, and a space in are)
Hello there!# Ho are yu?? <= 3 errors (# at end of there!, no w, no o and extra ? (all errors are still under the 1 word)
What I have so far:
I've created 6 arrays (3 for master, 3 for entry) and they are
CharArray of all chars
StringArray of all strings(words) including the spaces
IntArray with length of the string in each StringArray
My biggest trouble is if the entry text is wrong and it's shorter or longer than the master. I keep getting IndexOutOfRange exceptions (understandably) but can't fathom how to go about checking and writing the code to compensate.
I hope I have made myself clear enough as to what I need help with. If anyone could give some code examples or something to shoot me in the right path would be very helpful.
Have you looked into the Levenshtein distance algorithm? It returns the number of differences between two strings, which, in your case would be texting errors. Implementing the algorithm based off the pseudo-code found on the wikipedia page passes the first 3 of your 4 use cases:
Assert.AreEqual(2, LevenshteinDistance("Hello there!", "Helothere!");
Assert.AreEqual(1, LevenshteinDistance("Hello there!", "Hello there!!"));
Assert.AreEqual(4, LevenshteinDistance("Hello there! How are you?", "Helothere!! How a re you?"));
Assert.AreEqual(3, LevenshteinDistance("Hello there! How are you?", "Hello there!# Ho are yu??")); //fails, returns 4 errors
So while not perfect out of the box, it is probably a good starting point for you. Also, if you have too much trouble implementing your scoring rules, it might be worth revisiting them.
hth
Update:
Here is the result of the string you requested in the comments:
Assert.AreEqual(7, LevenshteinDistance("Hello there! How are you?", "Hlothere!! Hw a reYou?"); //fails, returns 8 errors
And here is my implementation of the Levenshtein Distance algorithm:
int LevenshteinDistance(string left, string right)
{
if (left == null || right == null)
{
return -1;
}
if (left.Length == 0)
{
return right.Length;
}
if (right.Length == 0)
{
return left.Length;
}
int[,] distance = new int[left.Length + 1, right.Length + 1];
for (int i = 0; i <= left.Length; i++)
{
distance[i, 0] = i;
}
for (int j = 0; j <= right.Length; j++)
{
distance[0, j] = j;
}
for (int i = 1; i <= left.Length; i++)
{
for (int j = 1; j <= right.Length; j++)
{
if (right[j - 1] == left[i - 1])
{
distance[i, j] = distance[i - 1, j - 1];
}
else
{
distance[i, j] = Min(distance[i - 1, j] + 1, //deletion
distance[i, j - 1] + 1, //insertion
distance[i - 1, j - 1] + 1); //substitution
}
}
}
return distance[left.Length, right.Length];
}
int Min(int val1, int val2, int val3)
{
return Math.Min(val1, Math.Min(val2, val3));
}
You need to come up with a scoring systems that works for you're situation.
I would make a word array after each space.
If a word is found on the same index +5.
If a word is found on the same index +-1 index location +3 (keep a counter how much words differ to increase the +- correction
If a needed word is found as part of another word +2
etc.etc. Matching words is hard, getting up with a rules engine that works is 'easier'
I once implemented an algorithm (which I can't find at the moment, I'll post code when I find it) which looked at the total number of PAIRS in the target string. i.e. "Hello, World!" would have 11 pairs, { "He", "el", "ll",...,"ld", "d!" }.
You then do the same thing on an input string such as "Helo World" so you have { "He",...,"ld" }.
You can then calculate accuracy as a function of correct pairs (i.e. input pairs that are in the list of target pairs), incorrect pairs (i.e. input pairs that do not exists in the list of target pairs), compared to the total list of target pairs. Over long enough sentences, this measure would be very accurate fair.
A simple algorithm would be to check letter by letter. If the letters differ increment the num of errors. If the next pairing of letters match, its a switched letter so just continue. If the messup matches the next letter, it is an omission and treat it accordingly. If the next letter matches the messed up one, its an insertion and treat it accordingly. Else the person really messed up and continue.
This doesn't get everything but with a few modifications this could become comprehensive.
a weak attempt at pseudocode:
edit: new idea. look at comments. I don't know the string functions off the top of my head so you'll have to figure that part out. The algorithm kinda fails for words that repeat a lot though...
string entry; //we'll pretend that this has stuff inside
string master; // this too...
string tempentry = entry; //stuff will be deleted so I need a copy to mess up
int e =0; //word index for entry
int m = 0; //word index for master
int errors = 0;
while(there are words in tempentry) //!tempentry.empty() ?
string mword = the next word in master;
m++;
int eplace = find mword in tempentry; //eplace is the index of where the mword starts in tempentry
if(eplace == -1) //word not there...
continue;
else
errors += m - e;
errors += find number of spaces before eplace
e = m // there is an error
tempentry = stripoff everything between the beginning and the next word// substring?
all words and spaces left in master are considered errors.
There are a couple of bounds checking errors that need to be fixed here but its a good start.
I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
How many parameters can you pass to a string.Format() method?
There must be some sort of theoretical or enforced limit on it. Is it based on the limits of the params[] type or the memory usage of the app that is using it or something else entirely?
OK, I emerge from hiding... I used the following program to verify what was going on and while Marc pointed out that a string like this "{0}{1}{2}...{2147483647}" would succeed the memory limit of 2 GiB before the argument list, my findings did't match yours. Thus the hard limit, of the number of parameters you can put in a string.Format method call has to be 107713904.
int i = 0;
long sum = 0;
while (sum < int.MaxValue)
{
var s = sizeof(char) * ("{" + i + "}").Length;
sum += s; // pseudo append
++i;
}
Console.WriteLine(i);
Console.ReadLine();
Love the discussion people!
Not as far as I know...
well, the theoretical limit would be the int32 limit for the array, but you'd hit the string length limit long before that, I guess...
Just don't go mad with it ;-p It may be better to write lots of small fragments to (for example) a file or response, than one huge hit.
edit - it looked like there was a limit in the IL (0xf4240), but apparently this isn't quite as it appears; I can make it get quite large (2^24) before I simply run out of system memory...
Update; it seems to me that the bounding point is the format string... those {1000001}{1000002} add up... a quick bit of math (below) shows that the maximum useful number of arguments we can use is 206,449,129:
long remaining = 2147483647;// max theoretical format arg length
long count = 10; // i.e. {0}-{9}
long len = 1;
int total = 0;
while (remaining >= 0) {
for(int i = 0 ; i < count && remaining >= 0; i++) {
total++;
remaining -= len + 2; // allow for {}
}
count *= 10;
len++;
}
Console.WriteLine(total - 1);
Expanding on Marc's detailed answer.
The only other limitation that is important is for the debugger. Once you pass a certain number of parameters directly to a function, the debugger becomes less functional in that method. I believe the limit is 64 parameters.
Note: This does not mean an array with 64 members, but 64 parameters passed directly to the function.
You might laugh and say "who would do this?" which is certainly a valid question. Yet LINQ makes this a lot easier than you think. Under the hood in LINQ the compiler generates a lot of code. It's possible that for a large generate SQL query where more than 64 fields are selected that you would hit this issue. Because the compiler under the hood would need to pass all of the fields to the constructor of an anonymous type.
Still a corner case.
Considering that both the limit of the Array class and the String class are the upper limit of Int32 (documented at 2,147,483,647 here: Int32 Structure), it is reasonable to believe that this value is the limit of the number string of format parameters.
Update Upon checking reflector, John is right. String.Format, using the Red Gate Reflector, shows the ff:
public static string Format(IFormatProvider provider, string format, params object[] args)
{
if ((format == null) || (args == null))
{
throw new ArgumentNullException((format == null) ? "format" : "args");
}
StringBuilder builder = new StringBuilder(format.Length + (args.Length * 8));
builder.AppendFormat(provider, format, args);
return builder.ToString();
}
The format.Length + (args.Length * 8) part of the code is enough to kill most of that number. Ergo, '2,147,483,647 = x + 8x' leaves us with x = 238,609,294 (theoretical).
It's far less than that of course; as the guys in the comments mentioned the string hitting the string length limit earlier is quite likely.
Maybe someone should just code this into a machine problem! :P