How can I make nested string splits?

How can I make nested string splits? - c#

I have what seemed at first to be a trivial problem but turned out to become something I can't figure out how to easily solve. I need to be able to store lists of items in a string. Then those items in turn can be a list, or some other value that may contain my separator character. I have two different methods that unpack the two different cases but I realized I need to encode the contained value from any separator characters used with string.Split.
To illustrate the problem:
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" }
string list = string.Join(";", nested);
string[] unnested = list.Split(';'); // EEK! returns 7 items, expected 3!
This would produce a list "mary;john;carl;dog;cat;fish;plainValue", a value I can't split to get the three original nested strings from. Indeed, instead of the three original strings, I'd get 7 strings on split and this approach thus doesn't work at all.
What I want is to allow the values in my string to be encoded so I can unpack/split the contents just the way before I packed/join them. I assume I might need to go away from string.Split and string.Join and that is perfectly fine. I might just have overlooked some useful class or method.
How can I allow any string values to be packed / unpacked into lists?
I prefer neat, simple solutions over bulky if possible.
For the curious mind, I am making extensions for PlayerPrefs in Unity3D, and I can only work with ints, floats and strings. Thus I chose strings to be my data carrier. This is why I am making this nested list of strings.

try:
const char joinChar = '╗'; // make char const
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" };
string list = string.Join(Convert.ToString(joinChar), nested);
string[] unnested = list.Split(joinChar); // eureka returns 3!
using an ascii character outside the normal 'set' allows you to join and split without ruining your logic that is separated on the ; char.

Encode your strings with base64 encoding before joining.

The expected items are 7 because you're splitting with a ; char. I would suggest to change your code to:
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" }
string list = string.Join("#" nested);
string[] unnested = list.Split('#'); // 3 strings again

Have you considered using a different separator, eg "|"?
This way the joined string will be "mary;john;carl|dog;cat;fish|plainValue" and when you call list.split("|"); it will return the three original strings

Use some other value than semicolon (;) for joining. For example - you can use comma (,) and you will get "mary;john;carl,dog;cat;fish,plainValue". When you again split it based on (,) as a separator, you should get back your original string value.

I came up with a solution of my own as well.
I could encode the length of an item, followed with the contents of an item. It would not use string.Split and string.Join at all, but it would solve my problem. The content would be untouched, and any content that need encoding could in turn use this encoding in its content space.
To illustrate the format (constant length header):
< content length > < raw content >
To illustrate the format (variable length header):
< content length > < header stop character > < raw content >
In the former, a fixed length of characters are used to describe the length of the contents. This could be plain text, hexadecimal, base64 or some other encoding.
Example with 4 hexadecimals (ffff/65535 max length):
0005Hello0005World
In the latter example, we can reduce this to:
5:Hello5:World
Then I could look for the first occurance of : and parse the length first, to extract the substring that follows. After that is the next item of the list.
A nested example could look like:
e:5:Hello5:Worlda:2:Hi4:John
(List - 14 charactes including headers)
Hello (5 characters)
World (5 characters)
(List - 10 characters including headers)
Hi (2 characters)
John (4 characters)
A drawback is that it explicitly requires the length of all items, even if no "shared separator" character wouldn't been present (this solution use no separators if using fixed length header).

Maby not as nice as you wanted. But here goes :)
static void Main(string[] args)
{
string[] str = new string[] {"From;niklas;to;lasse", "another;day;at;work;", "Bobo;wants;candy"};
string compiledString = GetAsString(str);
string[] backAgain = BackToStringArray(compiledString);
}
public static string GetAsString(string[] strings)
{
string returnString = string.Empty;
using (MemoryStream ms = new MemoryStream())
{
using (BinaryWriter writer = new BinaryWriter(ms))
{
writer.Write(strings.Length);
for (int i = 0; i < strings.Length; ++i)
{
writer.Write(strings[i]);
}
}
ms.Flush();
byte[] array = ms.ToArray();
returnString = Encoding.UTF8.GetString(array);
}
return returnString;
}
public static string[] BackToStringArray(string encodedString)
{
string[] returnStrings = new string[0];
byte[] toBytes = Encoding.UTF8.GetBytes(encodedString);
using (MemoryStream stream = new MemoryStream(toBytes))
{
using (BinaryReader reader = new BinaryReader(stream))
{
int numStrings = reader.ReadInt32();
returnStrings = new string[numStrings];
for (int i = 0; i < numStrings; ++i)
{
returnStrings[i] = reader.ReadString();
}
}
}
return returnStrings;
}

Related

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

I've got two strings which are derived from Windows filenames, which contain unicode characters that do not display correctly in Windows (they show just the square box "unknown character" instead of the correct character). However the filenames are valid and these files exist without problems in the operating system, which means I need to be able to deal with them correctly and accurately.
I'm loading the filenames the usual way:
string path = #"c:\folder";
foreach (FileInfo file in DirectoryInfo.EnumerateFiles(path))
{
string filename = file.FullName;
}
but for the purposes of explaining this problem, these are the two filenames I'm having issues with:
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
Two strings, two filenames with a single unicode character plus an extension, both different. This so far is fine, I can read and write these files no problem, however I need to store these strings in a sqlite db and later retrieve them. Every attempt I make to do so results in both of these characters being changed to the "unknown character", so the original data is lost and I can no longer differentiate the two strings. At first I thought this was an sqlite issue, and I've made sure my db is in UTF16, but it turns out it's the conversion in c# to UTF16 that is causing the problem.
If I ignore sqlite entirely, and simply try to manually convert these strings to UTF16 (or to any other encoding), these characters are converted to the "unknown character" and the original data is lost. If I do this:
System.Text.Encoding enc = System.Text.Encoding.Unicode;
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
byte[] name1Bytes = enc.GetBytes(filename1);
byte[] name2Bytes = enc.GetBytes(filename2);
and I then inspect the bytearrays 'name1Bytes' and 'name2Bytes' they are both identical. and I can see that the unicode character in both cases has been converted to a pair of bytes 253 and 255 - the unknown character. and sure enough when I convert back
string newFilename1 = enc.GetString(name1Bytes);
string newFilename2 = enc.GetString(name2Bytes);
the orignal unicode character in each case is lost, and replaced with a diamond question mark symbol. I have lost the original filenames altogether.
It seems that these encoding conversions rely on the system font being able to display the characters, and this is a problem as these strings already exist as filenames, and changing the filenames isn't an option. I need to preserve this data somehow when sending it to sqlite, and when it's sent to sqlite it will go through a conversion process to UTF16, and it's this conversion that I need it to survive without losing data.

If you cast a char to an int, you get the numeric value, bypassing the Unicode conversion mechanism:
foreach (char ch in filename1)
{
int i = ch; // 0x0000de18 == 56856 for the first char in filename1
... do whatever, e.g., create an int array, store it as base64
}
This turns out to work as well, and is perhaps more elegant:
foreach (int ch in filename1)
{
...
}
So perhaps something like this:
string Encode(string raw)
{
byte[] bytes = new byte[2 * raw.Length];
int i = 0;
foreach (int ch in raw)
{
bytes[i++] = (byte)(ch & 0xff);
bytes[i++] = (byte)(ch >> 8);
}
return Convert.ToBase64String(bytes);
}
string Decode(string encoded)
{
byte[] bytes = Convert.FromBase64String(encoded);
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; ++i)
{
chars[i] = (char)(bytes[i * 2] | (bytes[i * 2 + 1] << 8));
}
return new string(chars);
}

C# only use arrays if it exists

I am a beginner in programming. And now I'm facing a task where I can't get any further. Probably it is relatively easy to solve.
This is what I want to do: I read out a .txt file and there are several lines of content.
Example what is in the .txt file:
text1,text2,text3
text1,text2,
text1,text2,text3,text4
I'm now ready to find the right line and use it. Then I want to split the line and assign each text to its own string.
I can do this if I know that this line have 4 words. But what if I don't know how many words this line have.
For example if I want to assign 5 strings but there are only 4 arrays in the column I get an error.
My program currently looks like this:
string reader = "text1,text2,text3,text4";
string[] words = reader.Split(',');
string word1 = words[0].ToString();
string word2 = words[1].ToString();
string word3 = words[2].ToString();
string word4 = words[3].ToString();
textBox1.Text = word3;
My goal is to find out how many words are in the string. And then pass each word to a separate string.
Thank you in advance

To get the length of the Array, you can easily use .Length
In your example, you just write
int arraylength = words.Length;
I don't understand, why do you want to create a new String for every value of the string-array? You can just use them in the array.
In your example you always user .ToString(), this isn't necessary because you already have a string.
An array is just multiple variables (in your example strings) which are connected to another.

I doubt if you want separated local variables like word1, word2 etc. To see why, let's
bring the idea to the point of absurdity. Imagine, that we have a small narration with 1234
words only. Do we really want to create word1, word2, ..., word1234 local variables?
So, let's stick to a single words array only:
string[] words = reader.Split(',');
Now, you can easily get array Length (i.e. number of items):
textBoxCount.Text = $"We have {words.Length} words in total";
Or get N-th word (let N be one based) from the words array:
string wordN = array.Length >= N ? array[N - 1] : "SomeDefaultValue";
In your case (3d word) it can be
// either 3d word or an empty string (when we have just two less words)
textBox1.Text = array.Length >= 3 ? array[3 - 1] : "";
Technically, you can use Linq and query the reader string:
using System.Linq;
...
// 3d word or empty string
textBox1.Text = reader.Split(',').ElementAtOrDefault(3 - 1) ?? "";
But Linq seems to be overshot here.

Find a delimiter of csv or text files in c#

I want to find a delimiter being used to separate the columns in csv or text files.
I am using TextFieldParser class to read those files.
Below is my code,
String path = #"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
if (path.Contains(".txt"))
{
parser.SetDelimiters("|");
}
else
{
parser.SetDelimiters(",");
}
parser.HasFieldsEnclosedInQuotes = true;
bool firstLine = true;
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (firstLine)
{
foreach (var val in fields)
{
dt.Columns.Add(val);
}
firstLine = false;
continue;
}
dt.Rows.Add(fields);
}
}
lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
dgvTextFieldParser1.DataSource = dt;
Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.
How can I do that?

Mathematically correct but totally useless answer: It's not possible.
Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.
For instance, can we assume that the delimiter is one or any of the elements in the set below?
List<char> delimiters = new List<char>{' ', ';', '|'};
Or can we assume that the delimiter is such that it produces elements of equal length?
Should we try to find a delimiter that's a single character or can a word be one?
Etc.
Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.
How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?
List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
counts[c] = textArray.Count(t => t == c);
I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.
You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.

Very simple guessing approach using LINQ:
static class CsvSeperatorDetector
{
private static readonly char[] SeparatorChars = {';', '|', '\t', ','};
public static char DetectSeparator(string csvFilePath)
{
string[] lines = File.ReadAllLines(csvFilePath);
return DetectSeparator(lines);
}
public static char DetectSeparator(string[] lines)
{
var q = SeparatorChars.Select(sep => new
{Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
.OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
.ThenBy(res => res.Found.Count())
.First();
return q.Separator;
}
}
What this does is it reads the file line by line (note that CSV files may include line breaks), then checks for each potential separator how often it occurs in each line.
Then we check which separator occurs on the most lines, and of those which occur on the same number of lines, we take the one with the most even distribution (e.g. 5 occurences on every line are ranked higher than one that occurs once in one line and 10 times in another line).
Of course you might have to tweak this for your own purposes, add error handling, fallback logic and so forth. I'm sure it's not perfect, but it's good enough for me.

You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.

In python we can do this easily by using csv sniffer. It will cater for text files and also if you just need to read some bytes from the file.

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!

You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979

I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}

It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.

You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');

If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"

Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

C# read txt file and store the data in formatted array

I have a text file which contains following
Name address phone salary
Jack Boston 923-433-666 10000
all the fields are delimited by the spaces.
I am trying to write a C# program, this program should read a this text file and then store it in the formatted array.
My Array is as follows:
address
salary
When ever I am trying to look in google I get is how to read and write a text file in C#.
Thank you very much for your time.

You can use File.ReadAllLines method to load the file into an array. You can then use a for loop to loop through the lines, and the string type's Split method to separate each line into another array, and store the values in your formatted array.
Something like:
static void Main(string[] args)
{
var lines = File.ReadAllLines("filename.txt");
for (int i = 0; i < lines.Length; i++)
{
var fields = lines[i].Split(' ');
}
}

Do not reinvent the wheel. Can use for example fast csv reader where you can specify a delimeter you need.
There are plenty others on internet like that, just search and pick that one which fits your needs.

This answer assumes you don't know how much whitespace is between each string in a given line.
// Method to split a line into a string array separated by whitespace
private string[] Splitter(string input)
{
return Regex.Split(intput, #"\W+");
}
// Another code snippet to read the file and break the lines into arrays
// of strings and store the arrays in a list.
List<String[]> arrayList = new List<String[]>();
using (FileStream fStream = File.OpenRead(#"C:\SomeDirectory\SomeFile.txt"))
{
using(TextReader reader = new StreamReader(fStream))
{
string line = "";
while(!String.IsNullOrEmpty(line = reader.ReadLine()))
{
arrayList.Add(Splitter(line));
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I make nested string splits? - c#

Encode your strings with base64 encoding before joining.

The expected items are 7 because you're splitting with a ; char. I would suggest to change your code to: string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" } string list = string.Join("#" nested); string[] unnested = list.Split('#'); // 3 strings again

Have you considered using a different separator, eg "|"? This way the joined string will be "mary;john;carl|dog;cat;fish|plainValue" and when you call list.split("|"); it will return the three original strings

Use some other value than semicolon (;) for joining. For example - you can use comma (,) and you will get "mary;john;carl,dog;cat;fish,plainValue". When you again split it based on (,) as a separator, you should get back your original string value.

Related

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

C# only use arrays if it exists

Find a delimiter of csv or text files in c#

How to properly split a CSV using C# split() function?

C# read txt file and store the data in formatted array

Categories

Resources