Efficiently split a string in format "{ {}, {}, ...}"

Efficiently split a string in format "{ {}, {}, ...}" - c#

I have a string in the following format.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}"
private void parsestring(string input)
{
string[] tokens = input.Split(','); // I thought this would split on the , seperating the {}
foreach (string item in tokens) // but that doesn't seem to be what it is doing
{
Console.WriteLine(item);
}
}
My desired output should be something like this below:
112,This is the first day 23/12/2009
132,This is the second day 24/12/2009
But currently, I get the one below:
{112
This is the first day 23/12/2009
{132
This is the second day 24/12/2009
I am very new to C# and any help would be appreciated.

Don't fixate on Split() being the solution! This is a simple thing to parse without it. Regex answers are probably also OK, but I imagine in terms of raw efficiency making "a parser" would do the trick.
IEnumerable<string> Parse(string input)
{
var results = new List<string>();
int startIndex = 0;
int currentIndex = 0;
while (currentIndex < input.Length)
{
var currentChar = input[currentIndex];
if (currentChar == '{')
{
startIndex = currentIndex + 1;
}
else if (currentChar == '}')
{
int endIndex = currentIndex - 1;
int length = endIndex - startIndex + 1;
results.Add(input.Substring(startIndex, length));
}
currentIndex++;
}
return results;
}
So it's not short on lines. It iterates once, and only performs one allocation per "result". With a little tweaking I could probably make a C#8 version with Index types that cuts on allocations? This is probably good enough.
You could spend a whole day figuring out how to understand the regex, but this is as simple as it comes:
Scan every character.
If you find {, note the next character is the start of a result.
If you find }, consider everything from the last noted "start" until the index before this character as "a result".
This won't catch mismatched brackets and could throw exceptions for strings like "}}{". You didn't ask for handling those cases, but it's not too hard to improve this logic to catch it and scream about it or recover.
For example, you could reset startIndex to something like -1 when } is found. From there, you can deduce if you find { when startIndex != -1 you've found "{{". And you can deduce if you find } when startIndex == -1, you've found "}}". And if you exit the loop with startIndex < -1, that's an opening { with no closing }. that leaves the string "}whoops" as an uncovered case, but it could be handled by initializing startIndex to, say, -2 and checking for that specifically. Do that with a regex, and you'll have a headache.
The main reason I suggest this is you said "efficiently". icepickle's solution is nice, but Split() makes one allocation per token, then you perform allocations for each TrimX() call. That's not "efficient". That's "n + 2 allocations".

Use Regex for this:
string[] tokens = Regex.Split(input, #"}\s*,\s*{")
.Select(i => i.Replace("{", "").Replace("}", ""))
.ToArray();
Pattern explanation:
\s* - match zero or more white space characters

Well, if you have a method that is called ParseString, it's a good thing it returns something (and it might not be that bad to say that it is ParseTokens instead). So if you do that, you can come to the following code
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
The reason why it didn't work for you before, is because your understanding of how the split method works, was wrong, it will effectively split on all , in your example.
Now if you put this all together, you get something like in this dotnetfiddle
using System;
using System.Collections.Generic;
public class Program
{
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
public static void Main()
{
var instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
foreach (var item in ParseTokens( instance ) ) {
Console.WriteLine( item );
}
}
}

Add using System.Text.RegularExpressions; to top of the class
and use the regex split method
string[] tokens = Regex.Split(input, "(?<=}),");
Here, we use positive lookahead to split on a , which is immediately after a }
(note: (?<= your string ) matches all the characters after your string only. you can read more about it here

If you dont want to your regular expressions, the following code will produce your required output.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
string[] tokens = instance.Replace("},{", "}{").Split('}', '{');
foreach (string item in tokens)
{
if (string.IsNullOrWhiteSpace(item)) continue;
Console.WriteLine(item);
}
Console.ReadLine();

Related

How to get parentheses inside parentheses

I'm trying to keep a parenthese within a string that's surrounded by a parenthese.
The string in question is: test (blue,(hmmm) derp)
The desired output into an array is: test and (blue,(hmmm) derp).
The current output is: (blue,, (hmm) and derp).
My current code is thatof this:
var input = Regex
.Split(line, #"(\([^()]*\))")
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
How can i extract the text inside the outside parentheses (keeping them) and keep the inside parenthese as one string in an array?
EDIT:
To clarify my question, I want to ignore the inner parentheses and only split on the outer parentheses.
herpdediderp (orange,(hmm)) some other crap (red,hmm)
Should become:
herpdediderp, orange,(hmm), some other crap and red,hmm.
The code works for everything except the double parentheses: (orange,(hmm)) to orange,(hmm).

You can use the method
public string Trim(params char[] trimChars)
Like this
string trimmedLine = line.Trim('(', ')'); // Specify undesired leading and trailing chars.
// Specify separator characters for the split (here command and space):
string[] input = trimmedLine.Split(new[]{',', ' '}, StringSplitOptions.RemoveEmptyEntries);
If the line can start or end with 2 consecutive parentheses, use simply good old if-statements:
if (line.StartsWith("(")) {
line = line.Substring(1);
}
if (line.EndsWith(")")) {
line = line.Substring(0, line.Length - 1);
}
string[] input = line.Split(new[]{',', ' '},

Lot's o' guessing going on here - from me and the others. You could try
[^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\)
It handles one level of parentheses recursion (could be extended though).
Here at regexstorm.
Visual illustration at regex101.
If this piques your interest, I'll add an explanation ;)
Edit:
If you need to use split, put the selection in to a group, like
([^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\))
and filter out empty strings. See example here at ideone.
Edit 2:
Not quite sure what behaviour you want with multiple levels of parentheses, but I assume this could do it for you:
([^(]+|\([^(]*(?:\([^(]*(?:\([^(]*\)[^(]*)*\)[^(]*)*\))
^^^^^^^^^^^^^^^^^^^ added
For each level of recursion you want, you "just" add another inner level. So this is for two levels of recursion ;)
See it here at ideone.

Hopefully someone will come up with a regex. Here's my code answer.
static class ExtensionMethods
{
static public IEnumerable<string> GetStuffInsideParentheses(this IEnumerable<char> input)
{
int levels = 0;
var current = new Queue<char>();
foreach (char c in input)
{
if (levels == 0)
{
if (c == '(') levels++;
continue;
}
if (c == ')')
{
levels--;
if (levels == 0)
{
yield return new string(current.ToArray());
current.Clear();
continue;
}
}
if (c == '(')
{
levels++;
}
current.Enqueue(c);
}
}
}
Test program:
public class Program
{
public static void Main()
{
var input = new []
{
"(blue,(hmmm) derp)",
"herpdediderp (orange,(hmm)) some other crap (red,hmm)"
};
foreach ( var s in input )
{
var output = s.GetStuffInsideParentheses();
foreach ( var o in output )
{
Console.WriteLine(o);
}
Console.WriteLine();
}
}
}
Output:
blue,(hmmm) derp
orange,(hmm)
red,hmm
Code on DotNetFiddle

I think if you think about the problem backwards, it becomes a bit easier - don't split on what you don't what, extract what you do want.
The only slightly tricky part if matching nested parentheses, I assume you will only go one level deep.
The first example:
var s1 = "(blue, (hmmm) derp)";
var input = Regex.Matches(s1, #"\((?:\(.+?\)|[^()]+)+\)").Cast<Match>().Select(m => Regex.Matches(m.Value, #"\(\w+\)|\w+").Cast<Match>().Select(m2 => m2.Value).ToArray()).ToArray();
// input is string[][] { string[] { "blue", "(hmmm)", "derp" } }
The second example uses an extension method:
public static string TrimOutside(this string src, string openDelims, string closeDelims) {
if (!String.IsNullOrEmpty(src)) {
var openIndex = openDelims.IndexOf(src[0]);
if (openIndex >= 0 && src.EndsWith(closeDelims.Substring(openIndex, 1)))
src = src.Substring(1, src.Length - 2);
}
return src;
}
The code/patterns are different because the two examples are being handled differently:
var s2 = "herpdediderp (orange,(hmm)) some other crap (red,hmm)";
var input3 = Regex.Matches(s2, #"\w(?:\w| )+\w|\((?:[^(]+|\([^)]+\))+\)").Cast<Match>().Select(m => m.Value.TrimOutside("(",")")).ToArray();
// input2 is string[] { "herpdediderp", "orange,(hmm)", "some other crap", "red,hmm" }

Find repeated occurrences in String

I'm currently trying to find all matches to a rule in a string and copy those to a vector. The purpose is to build an application which retrieves the top N .mp3 files (podcasts) from a community website.
My current tactic:
public static string getBetween(string strSource, string strStart, string strEnd)
{
int Start, End;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
Start = strSource.IndexOf(strStart, 0) + strStart.Length;
End = strSource.IndexOf(strEnd, Start);
string sFound = strSource.Substring(Start, End + 4 - Start);
strSource = strSource.Remove(Start, End + 4 - Start);
return sFound;
}
else
{
return"";
}
}
Called like this:
for (int i = 0; i < N; i++)
{
Podcast.Add(getBetween(searchDoc(#TARGET_HTM), "Sound/", ".mp3"));
}
Where searchDoc is:
public static string searchDoc(string strFile)
{
StreamReader sr = new StreamReader(strFile);
String line = sr.ReadToEnd();
return line;
}
Why am I posting such a big chunk of code?
This is my first application in C#. I assume my current tactic is flawed and I'd rather see a solution for the underlying problem than a cheap fix for lousy code. Feel free to do whatever you feel like though.
What it should do:
Find all occurrences of "Sound/" + * + ".mp3" (all MP3 files in the directory Sound, whatever their name, from the top of the target .htm file till N are found. Do so by returning the top occurrence and removing this from the String.
What it does:
It finds the first occurrence just fine. It also removes the occurrence just fine. However, it only does so from strSource which gets discarded at the end of the function.
Problem:
How do I return the modified string in a safe manner (no global variables or other improper tricks), so the found occurrence is properly removed and the next can be found?

This is the wrong approach. You can use Regex.Matches to get all matches of the pattern that you want. The regex would be something like "Sound/[^/\"]+\.mp3".
Once you have a list of matches you can apply .Cast<Match>().Take(3).Select(m => m.Value) to it to get the top 3 matches as strings.
It looks like you have a C++ background. This can lead to low-level designs out of habit. Try to avoid manual string parsing and loops.

Three flaws:
First, these two things seem to belong together strongly, but you split them over two functions.
Second, you forgot to use the startIndex parameter of Substring, requiring you to rebuild strings that are later discarded (this is a performance hit!)
Third, you had a small error: you hardcoded the length of strEnd as 4.
I just made an extension method based on your code, which fixes these 3 flaws. Untested, since I have no VS on this computer.
public static List<string> Split(this string source, string start, string end) {
List<string> result = new List<string>();
int i=0;
while(source.indexOf(start, i) != -1) {
startIndex = source.IndexOf(start, i) + start.Length;
endIndex = source.IndexOf(end, start);
result.Add(source.Substring(startIndex, endIndex + end.Length - startIndex));
i = endIndex;
}
return result;
}

C#: Getting Substring between two different Delimiters

I have problems splitting this Line. I want to get each String between "#VAR;" and "#ENDVAR;". So at the End, there should be a output of:
Variable=Speed;Value=Fast;
Variable=Fabricator;Value=Freescale;Op==;
Later I will separate each Substring, using ";" as a delimiter but that I guess wont be that hard. This is how a line looks like:
#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;
I tried some split-options, but most of the time I just get an empty string. I also tried a Regex. But either the Regex was wrong or it wasnt suitable to my String. Probably its wrong, at school we learnt Regex different then its used in C#, so I was confused while implementing.
Regex.Match(t, #"/#VAR([a-z=a-z]*)/#ENDVAR")
Edit:
One small question: I am iterating over many lines like the one in the question. I use NoIdeas code on the line to get it in shape. The next step would be to print it as a Text-File. To print an Array I would have to loop over it. But in every iteration, when I get a new line, I overwrite the Array with the current splitted string. I put the Rest of my code in the question, would be nice if someone could help me.
string[] w ;
foreach (EA.Element theObjects in myPackageObject.Elements)
{
theObjects.Type = "Object";
foreach (EA.Element theElements in PackageHW.Elements)
{
if (theObjects.ClassfierID == theElements.ElementID)
{
t = theObjects.RunState;
w = t.Replace("#ENDVAR;", "#VAR;").Replace("#VAR;", ";").Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in w)
{
tw2.WriteLine(s);
}
}
}
}
The piece with the foreach-loop is wrong pretty sure. I need something to print each splitted t. Thanks in advance.

you can do it without regex using
str.Replace("#ENDVAR;", "#VAR;")
.Split(new string[] { "#VAR;" }, StringSplitOptions.RemoveEmptyEntries);
and if you want to save time you can do:
str.Replace("#ENDVAR;", "#VAR;")
.Replace("#VAR;", ";")
.Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);

You can use a look ahead assertion here.
#VAR;(.*?)(?=#ENDVAR)
If your string never consists of whitespace between #VAR; and #ENDVAR; you could use the below line, this will not match empty instances of your lines.
#VAR;([^\s]+)(?=#ENDVAR)
See this demo

Answer using raw string manipulation.
IEnumerable<string> StuffFoundInside(string biggerString)
{
var closeDelimeterIndex = 0;
do
{
int openDelimeterIndex = biggerString.IndexOf("#VAR;", startingIndex);
if (openDelimeterIndex != -1)
{
closeDelimeterIndex = biggerString.IndexOf("#ENDVAR;", openDelimeterIndex);
if (closeDelimiterIndex != -1)
{
yield return biggerString.Substring(openDelimeterIndex, closeDelimeterIndex - openDelimiterIndex);
}
}
} while (closeDelimeterIndex != -1);
}
Making a list and adding each item to the list then returning the list might be faster, depending on how the code using this code would work. This allows it to terminate early, but has the coroutine overhead.

Use this regex:
(?i)#VAR;(.+?)#ENDVAR;
Group 1 in each match will be your line content.

(If you don't like regexs)
Code:
var s = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
var tokens = s.Split(new String [] {"#ENDVAR;#VAR;"}, StringSplitOptions.None);
foreach (var t in tokens)
{
var st = t.Replace("#VAR;", "").Replace("#ENDVAR;", "");
Console.WriteLine(st);
}
Output:
Variable=Speed;Value=Fast;Op==;
Variable=Fabricator;Value=Freescale;Op==;

Regex.Split works well but yields empty entries that have to be removed as shown here:
string[] result = Regex.Split(input, #"#\w+;")
.Where(s => s != "")
.ToArray();

I tried some split-options, but most of the time I just get an empty string.
In this case the requirements seem to be simpler than you're stating. Simply splitting and using linq will do your whole operation in one statement:
string test = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
List<List<string>> strings = (from s in test.Split(new string[]{"#VAR;",";#ENDVAR;"},StringSplitOptions.RemoveEmptyEntries)
let s1 = s.Split(new char[]{';'},StringSplitOptions.RemoveEmptyEntries).ToList<string>()
select (s1)).ToList<List<string>>();
the outpout is:
?strings[0]
Count = 3
[0]: "Variable=Speed"
[1]: "Value=Fast"
[2]: "Op=="
?strings[1]
Count = 3
[0]: "Variable=Fabricator"
[1]: "Value=Freescale"
[2]: "Op=="
To write the data to a file something like this will work:
foreach (List<string> s in strings)
{
System.IO.File.AppendAllLines("textfile1.txt", s);
}

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!

You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979

I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}

It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.

You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');

If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"

Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

Get the different substrings from one main string

I have the following main string which contains link Name and link URL. The name and url is combined with #;. I want to get the string of each link (name and url i.e. My web#?http://www.google.com), see example below
string teststring = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
and I want to get three different strings using any string function:
My web#?http://www.google.com
My Web2#?http://www.bing.se
Handbooks#?http://www.books.de

So this looks like you want to split on the space after a #;, instead of splitting at #; itself. C# provides arbitrary length lookbehinds, which makes that quite easy. In fact, you should probably do the replacement of #; with #? first:
string teststring = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
teststring = Regex.Replace(teststring, #"#;", "#?");
string[] substrings = Regex.Split(teststring, #"(?<=#\?\S*)\s+");
That's it:
foreach(var s in substrings)
Console.WriteLine(s);
Output:
My web#?http://www.google.com
My Web2#?http://www.bing.se
Handbooks#?http://www.books.se/
If you are worried that your input might already contain other #? that you don't want to split on, you can of course do the splitting first (using #; in the pattern) and then loop over substrings and do the replacement call inside the loop.

If these are constant strings, you can just use String.Substring. This will require you to count letters, which is a nuisance, in order to provide the right parameters, but it will work.
string string1 = teststring.Substring(0, 26).Replace(";","?");
If they aren't, things get complicated. You could almost do a split with " " as the delimiter, except that your site name has a space. Do any of the substrings in your data have constant features, such as domain endings (i.e. first .com, then .de, etc.) or something like that?

If you have any control on the input format, you may want to change it to be easy to parse, for example by using another separator between items, other than space.
If this format can't be changed, why not just implement the split in code? It's not as short as using a RegEx, but it might be actually easier for a reader to understand since the logic is straight forward.
This will almost definitely will be faster and cheaper in terms of memory usage.
An example for code that solves this would be:
static void Main(string[] args)
{
var testString = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
foreach(var x in SplitAndFormatUrls(testString))
{
Console.WriteLine(x);
}
}
private static IEnumerable<string> SplitAndFormatUrls(string input)
{
var length = input.Length;
var last = 0;
var seenSeparator = false;
var previousChar = ' ';
for (var index = 0; index < length; index++)
{
var currentChar = input[index];
if ((currentChar == ' ' || index == length - 1) && seenSeparator)
{
var currentUrl = input.Substring(last, index - last);
yield return currentUrl.Replace("#;", "#?");
last = index + 1;
seenSeparator = false;
previousChar = ' ';
continue;
}
if (currentChar == ';' && previousChar == '#')
{
seenSeparator = true;
}
previousChar = currentChar;
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficiently split a string in format "{ {}, {}, ...}" - c#

Use Regex for this: string[] tokens = Regex.Split(input, #"}\s,\s{") .Select(i => i.Replace("{", "").Replace("}", "")) .ToArray(); Pattern explanation: \s* - match zero or more white space characters

Related

How to get parentheses inside parentheses

Find repeated occurrences in String

C#: Getting Substring between two different Delimiters

How to properly split a CSV using C# split() function?

Get the different substrings from one main string

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficiently split a string in format "{ {}, {}, ...}" - c#

Use Regex for this: string[] tokens = Regex.Split(input, #"}\s*,\s*{") .Select(i => i.Replace("{", "").Replace("}", "")) .ToArray(); Pattern explanation: \s* - match zero or more white space characters

Related

How to get parentheses inside parentheses

Find repeated occurrences in String

C#: Getting Substring between two different Delimiters

How to properly split a CSV using C# split() function?

Get the different substrings from one main string

Categories

Resources

Use Regex for this: string[] tokens = Regex.Split(input, #"}\s,\s{") .Select(i => i.Replace("{", "").Replace("}", "")) .ToArray(); Pattern explanation: \s* - match zero or more white space characters