string.Format() parameters - c#

How many parameters can you pass to a string.Format() method?
There must be some sort of theoretical or enforced limit on it. Is it based on the limits of the params[] type or the memory usage of the app that is using it or something else entirely?

OK, I emerge from hiding... I used the following program to verify what was going on and while Marc pointed out that a string like this "{0}{1}{2}...{2147483647}" would succeed the memory limit of 2 GiB before the argument list, my findings did't match yours. Thus the hard limit, of the number of parameters you can put in a string.Format method call has to be 107713904.
int i = 0;
long sum = 0;
while (sum < int.MaxValue)
{
var s = sizeof(char) * ("{" + i + "}").Length;
sum += s; // pseudo append
++i;
}
Console.WriteLine(i);
Console.ReadLine();
Love the discussion people!

Not as far as I know...
well, the theoretical limit would be the int32 limit for the array, but you'd hit the string length limit long before that, I guess...
Just don't go mad with it ;-p It may be better to write lots of small fragments to (for example) a file or response, than one huge hit.
edit - it looked like there was a limit in the IL (0xf4240), but apparently this isn't quite as it appears; I can make it get quite large (2^24) before I simply run out of system memory...
Update; it seems to me that the bounding point is the format string... those {1000001}{1000002} add up... a quick bit of math (below) shows that the maximum useful number of arguments we can use is 206,449,129:
long remaining = 2147483647;// max theoretical format arg length
long count = 10; // i.e. {0}-{9}
long len = 1;
int total = 0;
while (remaining >= 0) {
for(int i = 0 ; i < count && remaining >= 0; i++) {
total++;
remaining -= len + 2; // allow for {}
}
count *= 10;
len++;
}
Console.WriteLine(total - 1);

Expanding on Marc's detailed answer.
The only other limitation that is important is for the debugger. Once you pass a certain number of parameters directly to a function, the debugger becomes less functional in that method. I believe the limit is 64 parameters.
Note: This does not mean an array with 64 members, but 64 parameters passed directly to the function.
You might laugh and say "who would do this?" which is certainly a valid question. Yet LINQ makes this a lot easier than you think. Under the hood in LINQ the compiler generates a lot of code. It's possible that for a large generate SQL query where more than 64 fields are selected that you would hit this issue. Because the compiler under the hood would need to pass all of the fields to the constructor of an anonymous type.
Still a corner case.

Considering that both the limit of the Array class and the String class are the upper limit of Int32 (documented at 2,147,483,647 here: Int32 Structure), it is reasonable to believe that this value is the limit of the number string of format parameters.
Update Upon checking reflector, John is right. String.Format, using the Red Gate Reflector, shows the ff:
public static string Format(IFormatProvider provider, string format, params object[] args)
{
if ((format == null) || (args == null))
{
throw new ArgumentNullException((format == null) ? "format" : "args");
}
StringBuilder builder = new StringBuilder(format.Length + (args.Length * 8));
builder.AppendFormat(provider, format, args);
return builder.ToString();
}
The format.Length + (args.Length * 8) part of the code is enough to kill most of that number. Ergo, '2,147,483,647 = x + 8x' leaves us with x = 238,609,294 (theoretical).
It's far less than that of course; as the guys in the comments mentioned the string hitting the string length limit earlier is quite likely.
Maybe someone should just code this into a machine problem! :P

Related

Reducing a BigInteger value in C#

I'm somewhat new to working with BigIntegers and have tried some stuff to get this system working, but feel a little stuck at the moment and would really appreciate a nudge in the right direction or a solution.
I'm currently working on a system which reduces BigInteger values down to a more readable form, and this is working fine with my current implementation, but I would like to further expand on it to get decimals implemented.
To better give a picture of what I'm attempting, I'll break it down.
In this context, we have a method which is taking a BigInteger, and returning it as a string:
public static string ShortenBigInt (BigInteger moneyValue)
With this in mind, when a number such as 10,000 is passed to this method, 10k will be returned. Same for 1,000,000 which will return 1M.
This is done by doing:
for(int i = 0; i < prefixes.Length; i++)
{
if(!(moneyValue >= BigInteger.Pow(10, 3*i)))
{
moneyValue = moneyValue / BigInteger.Pow(10, 3*(i-1));
return moneyValue + prefixes[i-1];
}
}
This system is working by grabbing a string from an array of prefixes and reducing numbers down to their simplest forms and combining the two and returning it when inside that prefix range.
So with that context, the question I have is:
How might I go about returning this in the same way, where passing 100,000 would return 100k, but also doing something like 1,111,111 would return 1.11M?
Currently, passing 1,111,111M returns 1M, but I would like that additional .11 tagged on. No more than 2 decimals.
My original thought was to convert the big integer into a string, then chunk out the first few characters into a new string and parse a decimal in there, but since prefixes don't change until values reach their 1000th mark, it's harder to tell when to place the decimal place.
My next thought was using BigInteger.Log to reduce the value down into a decimal friendly number and do a simple division to get the value in its decimal form, but doing this didn't seem to work with my implementation.
This system should work for the following prefixes, dynamically:
k, M, B, T, qd, Qn, sx, Sp,
O, N, de, Ud, DD, tdD, qdD, QnD,
sxD, SpD, OcD, NvD, Vgn, UVg, DVg,
TVg, qtV, QnV, SeV, SPG, OVG, NVG,
TGN, UTG, DTG, tsTG, qtTG, QnTG, ssTG,
SpTG, OcTG, NoTG, QdDR, uQDR, dQDR, tQDR,
qdQDR, QnQDR, sxQDR, SpQDR, OQDDr, NQDDr,
qQGNT, uQGNT, dQGNT, tQGNT, qdQGNT, QnQGNT,
sxQGNT, SpQGNT, OQQGNT, NQQGNT, SXGNTL
Would anyone happen to know how to do something like this? Any language is fine, C# is preferable, but I'm all good with translating. Thank you in advance!
formatting it manually could work a bit like this:
(prefixes as a string which is an char[])
public static string ShortenBigInt(BigInteger moneyValue)
{
string prefixes = " kMGTP";
double m2 = (double)moneyValue;
for (int i = 1; i < prefixes.Length; i++)
{
var step = Math.Pow(10, 3 * i);
if (m2 / step < 1000)
{
return String.Format("{0:F2}", (m2/step)) + prefixes[i];
}
}
return "err";
}
Although Falco's answer does work, it doesn't work for what was requested. This was the solution I was looking for and received some help from a friend on it. This solution will go until there are no more prefixes left in your string array of prefixes. If you do run out of bounds, the exception will be thrown and handled by returning "Infinity".
This solution is better due to the fact there is no crunch down to doubles/decimals within this process. This solution does not have a number cap, only limit is the amount of prefixes you make/provide.
public static string ShortenBigInt(BigInteger moneyValue)
{
if (moneyValue < 1000)
return "" + moneyValue;
try
{
string moneyAsString = moneyValue.ToString();
string prefix = prefixes[(moneyAsString.Length - 1) / 3];
BigInteger chopAmmount = (moneyAsString.Length - 1) % 3 + 1;
int insertPoint = (int)chopAmmount;
chopAmmount += 2;
moneyAsString = moneyAsString.Remove(Math.Min(moneyAsString.Length - 1, (int)chopAmmount));
moneyAsString = moneyAsString.Insert(insertPoint, ".");
return moneyAsString + " " + prefix;
}
catch (Exception exceptionToBeThrown)
{
return "Infinity";
}
}

System.DivideByZeroException on high numbers in C#

So I've been trying to write a program in C# that returns all factors for a given number (I will implement user input later). The program looks as follows:
//Number to divide
long num = 600851475143;
//initializes list
List<long> list = new List<long>();
//Defines combined variable for later output
var combined = string.Join(", ", list);
for (int i = 1; i < num; i++)
{
if (num % i == 0)
{
list.Add(i);
Console.WriteLine(i);
}
}
However, after some time the program starts to also try to divide negative numbers, which after some time ends in a System.DivideByZeroException. It's not clear to me why it does this. It only starts to do this after the "num" variable contains a number with 11 digits or more. But since I need such a high number, a fix or similiar would be highly appreciated. I am still a beginner.
Thank you!
I strongly suspect the problem is integer overflow. num is a 64-bit integer, whereas i is a 32-bit integer. If num is more than int.MaxValue, then as you increment i it will end up overflowing back to negative values and then eventually 0... at which point num % i will throw.
The simplest option is just to change i to be a long instead:
for (long i = 1; i < num; i++)
It's unfortunate that there'd be no warning in your original code - i is promoted to long where it needs to be, because there's an implicit conversion from int to long. It's not obvious to me what would need to change for this to be spotted in the language itself. It would be simpler for a Roslyn analyzer to notice this sort of problem.

How you can represent binary numbers on label?

I converted decimal to binary number however i dont know how to represent on label. I have a list of numbers 0 and 1,Now, how do I display the information on labels.In fact, i dont know how to represent on label.
private void btnRun_Click(object sender, EventArgs e)
{
var decimaltoBinary = fnDecimalToBinary(Convert.ToInt32(txtenterNumber.Text));
}
private List<int> fnDecimalToBinary(int number)
{
int[] decimalNumbers = new int[] { 1, 2, 4, 8, 16, 32, 64, 128, 256 };
List<int> binaryNumbers = new List<int>();
int locDecimalArray = 0;
int sumNumber = 0;
for (int i = 0; i < decimalNumbers.Length; i++)
{
if (number < decimalNumbers[i])
{
sumNumber = number;
locDecimalArray = i - 1;
for (int j = locDecimalArray; j >= 0; j--)
{
if (sumNumber == 0)
{
binaryNumbers.Add(0);
return binaryNumbers;
}
else if (sumNumber >= decimalNumbers[j])
{
sumNumber = sumNumber - decimalNumbers[j];
binaryNumbers.Add(1);
}
else if (sumNumber < decimalNumbers[j])
{
binaryNumbers.Add(0);
}
}
return binaryNumbers;
}
}
return binaryNumbers;
}
It seems that you've received a comment that explains how you can convert your List<int> to the string value you need for the Label control. However, it seems to me that for the purposes of this exercise, you might benefit from some help with the decimal-to-binary conversion itself. There are already a number of similar questions on Stack Overflow dealing with this scenario (as you can guess, converting to binary text is a fairly common programming exercise), but of course none will start with your specific code, so I think it's worth writing yet another answer. :)
Basing the conversion on a pre-computed list of numeric values is not a terrible way to go, especially for the purposes of learning. But your version has a bunch of extra code that's just not necessary:
Your outer loop doesn't accomplish anything except verify that the number passed is within the range permitted by your pre-computed values. But this can be done as part of the conversion itself.
Furthermore, I'm not convinced that returning an empty list is really the best way to deal with invalid input. Throwing an exception would be more appropriate, as this forces the caller to deal with errors, and allows you to provide a textual message for display to the user.
The value 0 is always less than any of the digit values you've pre-computed, so there's no need to check for that explicitly. You really only need the if and a single else inside the inner loop.
Since you are the one populating the array, and since for loops are generally more readable when they start at 0 and increment the index as opposed to starting at the end and decrement, it seems to me that you would be better off writing the pre-computed values in reverse.
Entering numbers by hand is a pain and it seems to me that the method could be more flexible (i.e. support larger binary numbers) if you allowed the caller to pass the number of digits to produce, and used that to compute the values at run-time (though, if for performance reasons that's less desirable, pre-computing the largest digits that would be used and storing that in a static field, and then just using whatever subset of that you need, would be yet another suitable approach).
With those changes, you would get something like this:
private List<int> DecimalToBinary(int number, int digitCount)
{
// The number can't itself have more than 32 digits, so there's
// no point in allowing the caller to ask for more than that.
if (digitCount < 1 || digitCount > 32)
{
throw new ArgumentOutOfRangeException("digitCount",
"digitCount must be between 1 and 32, inclusive");
}
long[] digitValues = Enumerable.Range(0, digitCount)
.Select(i => (long)Math.Pow(2, digitCount - i - 1)).ToArray();
List<int> binaryDigits = new List<int>(digitCount);
for (int i = 0; i < digitValues.Length; i++)
{
if (digitValues[i] <= number)
{
binaryDigits.Add(1);
number = (int)(number - digitValues[i]);
}
else
{
binaryDigits.Add(0);
}
}
if (number > 0)
{
throw new ArgumentOutOfRangeException("digitCount",
"digitCount was not large number to accommodate the number");
}
return binaryDigits;
}
And here's an example of how you might use it:
private void button1_Click(object sender, EventArgs e)
{
int number;
if (!int.TryParse(textBox1.Text, out number))
{
MessageBox.Show("Could not convert user input to an int value");
return;
}
try
{
List<int> binaryDigits = DecimalToBinary(number, 8);
label3.Text = string.Join("", binaryDigits);
}
catch (ArgumentOutOfRangeException e1)
{
MessageBox.Show("Exception: " + e1.Message, "Could not convert to binary");
}
}
Now, the above example fits the design you originally had, just cleaned it up a bit. But the fact is, the computer already knows binary. That's how it stores numbers, and even if it didn't, C# includes operators that treat the numbers as binary (so if the computer didn't use binary, the run-time would be required to translate for you anyway). Given that, it's actually a lot easier to convert just by looking at the individual bits. For example:
private List<int> DecimalToBinary2(int number, int digitCount)
{
if (digitCount < 1 || digitCount > 32)
{
throw new ArgumentOutOfRangeException("digitCount",
"digitCount must be between 1 and 32, inclusive");
}
if (number > Math.Pow(2, digitCount) - 1)
{
throw new ArgumentOutOfRangeException("digitCount",
"digitCount was not large number to accommodate the number");
}
List<int> binaryDigits = new List<int>(digitCount);
for (int i = digitCount - 1; i >= 0; i--)
{
binaryDigits.Add((number & (1 << i)) != 0 ? 1 : 0);
}
return binaryDigits;
}
The above simply starts at the highest possible binary digit (given the desired count of digits), and checks each individual digit in the provided number, using the "bit-shift" operator << and the logical bitwise "and" operator &. If you're not already familiar with binary arithmetic, shift operations, and these operators, this might seem like overkill. But it's actually a fundamental aspect of how computers work, worth knowing, and of course as shown above, can dramatically simplify code required to deal with binary data (to the point where parameter validation code takes up half the method :) ).
One last thing: this entire discussion ignores the fact that you're using a signed int value, rather than the unsigned uint type. Technically, this means your code really ought to be able to handle negative numbers as well. However, doing so is a bit trickier when you also want to deal with binary digit counts that are less than the natural width of the number in the numeric type (e.g. 32 bits for an int). Conversely, if you don't want to support negative numbers, you should really be using the uint type instead of int.
I figured that trying to address that particular complication would dramatically increase the complexity of this answer and take away from the more fundamental details that seemed worth conveying. So I've left that out. But I do encourage you to look more deeply into how computers represent numbers, and why negative numbers require more careful handling than the above code is doing.

Ascii Bytes Array To Int32 or Double

I'm re-writing alibrary with a mandate to make it totally allocation free. The goal is to have 0 collections after the app's startup phase is done.
Previously, there were a lot of calls like this:
Int32 foo = Int32.Parse(ASCIIEncoding.ASCII.GetString(bytes, start, length));
Which I believe is allocating a string. I couldn't find a C# library function that would do the same thing automatically. I looked at the BitConverter class, but it looks like that is only if your Int32 is encoded with the actual bytes that represent it. Here, I have an array of bytes representing Ascii characters that represent an Int32.
Here's what I did
public static Int32 AsciiBytesToInt32(byte[] bytes, int start, int length)
{
Int32 Temp = 0;
Int32 Result = 0;
Int32 j = 1;
for (int i = start + length - 1; i >= start; i--)
{
Temp = ((Int32)bytes[i]) - 48;
if (Temp < 0 || Temp > 9)
{
throw new Exception("Bytes In AsciiBytesToInt32 Are Not An Int32");
}
Result += Temp * j;
j *= 10;
}
return Result;
}
Does anyone know of a C# library function that already does this in a more optimal way? Or an improvement to make the above run faster (its going to be called millions of times during the day probably). Thanks!
Millions of times per day shouldn't be a problem - I'd expect that to be able to run hundreds of thousands of times per second. Personally I'd rewrite the above to only declare "temp" within the loop (and get rid of the Pascal-cases local variable names - urgh) but it should be okay.
The code would be more immediately understandable as:
int digit = bytes[i] - '0';
which does the same as your
Temp = ((Int32)bytes[i]) - 48;
line, but in a simpler way (IMO). They should behave exactly the same way.
On a general note, trying to write C# without any allocations is pretty harsh, and fights against the way the language and framework are designed. Do you believe this is actually a reasonable requirement? Admittedly I've heard about it being the way some games are written in managed code... but it does seem a bit odd.
Of course, you're going to allocate an exception if the bytes are inappropriate...
EDIT: Note that your code doesn't allow for negative numbers. Is that okay?

Does any one know of a faster method to do String.Split()?

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}

Categories

Resources