How to split a string into efficient way c# - c#

I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?

Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.

You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Ɓukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?

Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.

Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));

heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);

Related

Optimize recursive function that generates a sequence of chars

I have a function that needs to generate a sequence of num chars to test a security algorithm.
For instance, patternLength of 4 it would generate:
["0000", "0001", "0002", ... , "9999"]
and 3 would generate:
["000", "001", ... "999"] and so on.
As we know, recursion can be pretty expensive and begins to slow to a crawl at higher lengths so I'm hoping to speed it up by using some caching or DP. Is this at all possible?
Current function with recursion:
private static List<char> PossibleCharacters = new List<char>()
{
'0','1','2','3','4','5','6','7','8','9'
};
public static List<string> SequenceGenerator(int patternLength)
{
List<string> result = new List<string>();
if (patternLength > 0)
{
List<string> prev = SequenceGenerator(patternLength - 1);
foreach (string entry in prev)
{
foreach (char ch in PossibleCharacters)
{
result.Add(entry + ch);
}
}
}
else
{
result.Add("");
}
return result;
}
My messy attempt. I'm building the list starting with length 1, then 2, and so on. It gets to 999 -> 1000 where it becomes wrong, it should be 999 -> 0000. Of course, I'll need to clear the contents of the cache where the lengths aren't what I want.
// patternLength = 4
string[] result = new string[10000];
string[] cache = new string[10000];
result.Append("");
dp[0] = "0";
int i = 1;
int j = 0;
foreach (string entry in result)
{
foreach (char ch in PossibleCharacters)
{
cache[i] = entry + ch;
i++;
}
result[j + 1] = j.ToString();
j++;
}
return cache.toList();
Thanks all for your time.
Your code is inherently bounded by an O(10^n) minimum time complexity (where n is the length of the tested string) and there is no way to bypass this limit without changing its' functionality. In other words, your code is likely mostly slow because you are doing a lot of work and not because you implemented the aforementioned work in a particularly inefficient way. Optimizing string generation is very unlikely to provide any significant improvement in run time. Utilizing multithreading is more likely to provide real-world performance improvements in your described scenario, so personally I would start from there. Additionally, if you only need to access each tested string once, it is preferable to to generate them one-by-one as they are used in order to reduce your program's space complexity. Currently you are storing every string for the entire duration of your program or function's execution, which is a potentially wasteful usage of memory if they are only needed once.

How can I implement the Viterbi algorithm in C# to split conjoined words?

In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
How can I split multiple joined words?
Need help understanding this Python Viterbi algorithm
How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
the
be
and
...
attorney
Here is a link to a small example of such a text file I would like to use.
Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the
word with rank n in the list of words has probability roughly 1/(n log
N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution.
I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?
Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs.
For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost.
So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant.
Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. From your code its not exactly clear what words is. This can be a real bottleneck if you don't choose a good data structure. As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word:
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
And I'd also consider serializing it to disk to avoid building it up every time.
Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement.
NOTE: I'm assuming all words are cased consistently.

What is the Fastest way to split ':' seperated string into given number of chunks where result/record length is variable

I have a large string accepted from TCP listner which is in following format
"1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857:2/1/2012,234234234:3,7620257787,01234343456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:34,76202343457787,012434343456789,93339,34340922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036"
You can see that this is a : seperated string which contains Records which are comma seperated fields.
I am looking for the best (fastest) way that split the string in given number of chunks and take care that one chunk should contain full record (string upto ':')
or other way of saying , there should not be any chunck which is not ending with :
e.g. 20 MB string to 4 chunks of 5 MB each with proper records (thus size of each chunk may not be exactly 5 MB but very near to it and total of all 4 chunks will be 20 MB)
I hope you can understand my question (sorry for the bad english)
I like the following link , but it does not take care of full record while spliting also don't know if that is the best and fastest way.
Split String into smaller Strings by length variable
I don't know how large a 'large string' is, but initially I would just try it with the String.Split method.
The idea is to divide the lenght of your data for the num of blocks required, then look backwards to search the last sep in the current block.
private string[] splitToBlocks(string data, int numBlocks, char sep)
{
// We return an array of the request length
if (numBlocks <= 1 || data.Length == 0)
{
return new string [] { data };
}
string[] result = new string[numBlocks];
// The optimal size of each block
int blockLen = (data.Length / numBlocks);
int idx = 0; int pos = 0; int lastSepPos = blockLen;
while (idx < numBlocks)
{
// Search backwards for the first sep starting from the lastSepPos
char c = data[lastSepPos];
while (c != sep) { lastSepPos--; c = data[lastSepPos]; }
// Get the block data in the result array
result[idx] = data.Substring(pos, (lastSepPos + 1) - pos);
// Reposition for then next block
idx++;
pos = lastSepPos + 1;
if(idx == numBlocks-1)
lastSepPos = data.Length - 1;
else
lastSepPos = blockLen * (idx + 1);
}
return result;
}
Please test it. I have not fully tested for fringe cases.
OK, I suggest you way with two steps:
Split string into chunks (see below)
Check chunks for completeness
Splitting string into chunks with help of linq (linq extension method taked from Split a collection into `n` parts with LINQ? ):
string tcpstring = "chunk1 : chunck2 : chunk3: chunk4 : chunck5 : chunk6";
int numOfChunks = 4;
var chunks = (from string z in (tcpstring.Split(':').AsEnumerable()) select z).Split(numOfChunks);
List<string> result = new List<string>();
foreach (IEnumerable<string> chunk in chunks)
{
result.Add(string.Join(":",chunk));
}
.......
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}
Am I understand your aims clearly?
[EDIT]
In my opinion, In case of performance consideration, better way to use String.Split method for chunking
It seems you want to split on ":" (you can use the Split method).
Then you have to add ":" after splitting to each chunk that has been split.
(you can then split on "," for all the strings that have been split by ":".
int index = yourstring.IndexOf(":");
string[] whatever = string.Substring(0,index);
yourstring = yourstring.Substring(index); //make a new string without the part you just cut out.
this is a general view example, all you need to do is establish an iteration that will run while the ":" character is encountered; cheers...

Text Justification

I am looking for a c# function or routine that will center justify text.
For example, if I have a sentence, I have noticed that when the sentence is justified to the edges of the screen, that spaces are placed in the line. The inserted spaces start in the center and move out from there on both sides as needed as needed.
Is there a C# function that I can pass my string, say 50 chars, and get back a pretty 56 char string?
Thanks in advance,
Rob
Nice task. Here's a solution based on Linq extension methods. If you do not wish to use them, see history for previous version of code. In this example spaces on the left and right sides from center are 'equal' with respect to order of inserting.
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
public static String Justify(String s, Int32 count)
{
if (count <= 0)
return s;
Int32 middle = s.Length / 2;
IDictionary<Int32, Int32> spaceOffsetsToParts = new Dictionary<Int32, Int32>();
String[] parts = s.Split(' ');
for (Int32 partIndex = 0, offset = 0; partIndex < parts.Length; partIndex++)
{
spaceOffsetsToParts.Add(offset, partIndex);
offset += parts[partIndex].Length + 1; // +1 to count space that was removed by Split
}
foreach (var pair in spaceOffsetsToParts.OrderBy(entry => Math.Abs(middle - entry.Key)))
{
count--;
if (count < 0)
break;
parts[pair.Value] += ' ';
}
return String.Join(" ", parts);
}
static void Main(String[] args) {
String s = "skvb sdkvkd s kc wdkck sdkd sdkje sksdjs skd";
String j = Justify(s, 5);
Console.WriteLine("Initial: " + s);
Console.WriteLine("Result: " + j);
Console.ReadKey();
}
}
As far as I know, there is no "built-in" C# or .net library function for this, so you'd have to implement something on your own (or find some code online whose license suits your needs).
A simple greedy algorithm shouldn't be too difficult to implement, however:
Until the required number of characters is reached:
Extend the shortest sequence of spaces by one
(choose one randomly if there is more than one such sequence).
I'd distribute the spaces randomly rather than starting at the center, to make sure the spaces are evenly distributed among your text (rather than concentrated at one position). Oh, and keep in mind that some people consider fully-justified fixed-font text harder to read than left-justified fixed-font text.

Does any one know of a faster method to do String.Split()?

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}

Categories

Resources