Parsing CSV File enclosed with quotes in C# - c#

I've seen lots of samples in parsing CSV File. but this one is kind of annoying file...
so how do you parse this kind of CSV
"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"

The best answer in most cases is probably #Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.
The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.
So I rewrote for this case as a string extension. I think this is close.
Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.
The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As #Vash points out, you're better off following some standard and coding a little more OFfensively.
But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to #Vash's comment!
Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.
namespace YourFavoriteNamespace
{
using System;
using System.Collections.Generic;
using System.Text;
public static class Extensions
{
public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"',
bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
{
Queue<string> qReturn = new Queue<string>();
StringBuilder stringBuilder = new StringBuilder();
bool bInEscapeVal = false;
for (int i = 0; i < valToSplit.Length; i++)
{
if (!bInEscapeVal)
{
// Escape values must come immediately after a split.
// abc,"b,ca",cab has an escaped comma.
// abc,b"ca,c"ab does not.
if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
{
bInEscapeVal = true; // not capturing escapeChar as part of value; easy enough to change if need be.
}
else if (splittingChar == valToSplit[i])
{
qReturn.Enqueue(stringBuilder.ToString());
stringBuilder = new StringBuilder();
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
else
{
// Can't use switch b/c we're comparing to a variable, I believe.
if (escapeChar == valToSplit[i])
{
// Repeated escape always reduces to one escape char in this logic.
// So if you wanted "I'm ""double quote"" crazy!" to come out with
// the double double quotes, you're toast.
if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
{
i++;
stringBuilder.Append(escapeChar);
}
else if (!strictEscapeToSplitEvaluation)
{
bInEscapeVal = false;
}
// *** STINKY CONDITION ***
// Kinda defense, since only `", ` really makes sense.
else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
{
i = i+2;
stringBuilder.Append("\", ");
}
// *** EO STINKY CONDITION ***
else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
{
bInEscapeVal = false;
}
else
{
stringBuilder.Append(escapeChar);
}
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
}
// NOTE: The `captureEndingNull` flag is not tested.
// Catch null final entry? "abc,cab,bca," could be four entries, with the last an empty string.
if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
{
qReturn.Enqueue(stringBuilder.ToString());
}
return qReturn;
}
}
}
Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)
[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.
If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.
Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.
Enjoy. ;^)

I very strongly recommend using TextFieldParser. Hand-coded parsers that use String.Split or regular expressions almost invariably mishandle things like quoted fields that have embedded quotes or embedded separators.
I would be surprised, though, if it handled your particular example. As others have said, that line is, at best, ambiguous.

Split based on
",
I would use MyString.IndexOf("\","
And then substring the parts. Other then that im sure someone written a csv parser out there that can handle this :)

I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...
from this:
"Annoying","CSV File","poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"
to this:
"Annoying¤CSV File¤poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"
then split it:
ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
ArrayA[1]: CSV File
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
ArrayA[3]: yeah!"
after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again
from this
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
to this
ArrayA[2]: poop#mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby
then split it again and would turn to this
ArrayB[0]: poop#mypants.com
ArrayB[1]: 1999,01-20-2001
ArrayB[2]: oh,boy
ArrayB[3]: 01-20-2001
ArrayB[4]: yeah baby
and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC
It's tedious but there's no other way to do it...

There is one another open source library, Cinchoo ETL, handle quoted string fine. Here is sample code.
string csv = #"""1"",1/2/2010,""The sample(""adasdad"") asdada"",""I was pooping in the door ""Stinky"", so I'll be damn"",""AK""";
using (var r = ChoCSVReader.LoadText(csv)
.QuoteAllFields()
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
Output:
[Count: 5]
Key: Column1 [Type: Int64]
Value: 1
Key: Column2 [Type: DateTime]
Value: 1/2/2010 12:00:00 AM
Key: Column3 [Type: String]
Value: The sample(adasdad) asdada
Key: Column4 [Type: String]
Value: I was pooping in the door Stinky, so I'll be damn
Key: Column5 [Type: String]
Value: AK

You could split the string by ",". It is recomended that the csv file could each cell value should be enclosed in quotes like "1","2","3".....

I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.
"1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"
Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.

You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.
It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.

As no (decent) .csv parser can parse non-csv-data correctly, the task isn't to parse the data, but to fix the file(s) (and then to parse the correct data).
To fix the data you need a list of bad rows (to be sent to the person responsible for the garbage for manual editing). To get such a list, you can
use Access with a correct import specification to import the file. You'll get a list of import failures.
write a script/program that opens the file via the OLEDB text driver.
Sample file:
"Id","Remark","DateDue"
1,"This is good",20110413
2,"This is ""good""",20110414
3,"This is ""good"","bad",and "ugly",,20110415
4,"This is ""good""" again,20110415
Sample SQL/Result:
SELECT * FROM [badcsv01.csv]
Id Remark DateDue
1 This is good 4/13/2011
2 This is "good" 4/14/2011
3 This is "good", NULL
4 This is "good" again 4/15/2011
SELECT * FROM [badcsv01.csv] WHERE DateDue Is Null
Id Remark DateDue
3 This is "good", NULL

First you will do it for the columns names:
DataTable pbResults = new DataTable();
OracleDataAdapter oda = new OracleDataAdapter(cmd);
oda.Fill(pbResults);
StringBuilder sb1 = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
IEnumerable<string> columnNames = pbResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
sb1.Append(string.Join("\"" + "," + "\"", columnNames));
sb2.Append("\"");
sb2.Append(sb1);
sb2.AppendLine("\"");
Second you will do it for each row:
foreach (DataRow row in pbResults.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => field.ToString());
sb2.Append("\"");
sb2.Append(string.Join("\"" + "," + "\"", fields));
sb2.AppendLine("\"");
}

Related

How to avoid false separators in csv / XML

I've been trying to understand how XML and CSV parsing work, without actually writing any code yet. I might have to parse a .csv file in the ongoing project and I'd like to be ready. (I'll have to convert them to .ofx files)
I'm also aware there's probably a thousand XLM and csv parsers out there, so I'm more curious than I am worried. I intend on using the XMLReader that I believe microsoft provides.
Let's say I have the following .csv file
02/02/2016 ; myfirstname ; mylastname ; somefield ; 321654 ; commentary ; blabla
Sometimes a field will be missing. Which means, for the sake of the example, that the lastname isn't mandatory, and somefield could be right after the first name.
My questions are :
How do I avoid the confusion between somefield and lastname?
I could count the total number of fields, but in my situation two are optional, if there is only one missing, I can't be sure which one it is.
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
Again, I could count the remaining fields and find out where I am, but that excludes the optional fields problem.
My questions also apply to XML, what can I do if the user starts writing XML in his form ? Wether I decide to export the form as .csv or .xml, there can be trouble.
Right now I'm on the assumption that the c# Xml reader/parser are awesome enough to deal with it ; and if they are, I'm really curious on the how.
Assuming the CSV/XML data has been exported properly, none of this will be a problem. Missing fields will be handled by repeated separators:
02/02/2016;myfirstname;;somefield
Semi-colons within a field will normally be handled by quoting:
02/02/2016;"myfirst;name";
Quotes are escaped within a string:
02/02/2016;"my""first""name";
With XML it's even less of an issue since the tags or attributes will all have names.
If your CSV data is NOT well-formed, then you have a much bigger problem, as it may be impossible to distinguish missing fields and non-quoted separators.
How do I avoid false "tags"? String values should be quoted if the (can) contain separator characters. If you create the CSV file, quote and unquote all string values.
How do I avoid the confusion between somefield and lastname? No general solution for this, all case must be handled one by one. Can a general algorithm decide wheather first name or last name is missing? No.
If you know what field(s) can be omitted, you can write an "intelligent" handling.
Use XML and all of your problem will be solved.
Fisrt
How do I avoid the confusion between somefield and lastname?
There is no way to do this without change the logic of file. For example: when "mylastname" is empty You may have a "" value, empty string or like this ;;
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
It is simple you have to file like this:
; - separor of columns
"" - delimetr of columns
value;value;"value;;;;value";value
To split this only for separtor ; without the separator in "" this code do this is tested and compiled
public static string[] SplitWithDelimeter(this string line, char separator, char checkSeparator, bool eraseCheckSeparator)
{
var separatorsIndexes = new List<int>();
var open = false;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == checkSeparator)
{
open = !open;
}
if (!open && line[i] == separator )
{
separatorsIndexes.Add(i);
}
}
separatorsIndexes.Add(line.Length);
var result = new string[separatorsIndexes.Count];
var first = 0;
for (var j = 0; j < separatorsIndexes.Count; j++)
{
var tempLine = line.Substring(first, separatorsIndexes[j] - first);
result[j] = eraseCheckSeparator ? tempLine.Replace(checkSeparator, ' ').Trim() : tempLine;
first = separatorsIndexes[j] + 1;
}
return result;
}
Return would be:
value
value
"value;;;;value"
value

Find a delimiter of csv or text files in c#

I want to find a delimiter being used to separate the columns in csv or text files.
I am using TextFieldParser class to read those files.
Below is my code,
String path = #"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
if (path.Contains(".txt"))
{
parser.SetDelimiters("|");
}
else
{
parser.SetDelimiters(",");
}
parser.HasFieldsEnclosedInQuotes = true;
bool firstLine = true;
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (firstLine)
{
foreach (var val in fields)
{
dt.Columns.Add(val);
}
firstLine = false;
continue;
}
dt.Rows.Add(fields);
}
}
lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
dgvTextFieldParser1.DataSource = dt;
Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.
How can I do that?
Mathematically correct but totally useless answer: It's not possible.
Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.
For instance, can we assume that the delimiter is one or any of the elements in the set below?
List<char> delimiters = new List<char>{' ', ';', '|'};
Or can we assume that the delimiter is such that it produces elements of equal length?
Should we try to find a delimiter that's a single character or can a word be one?
Etc.
Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.
How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?
List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
counts[c] = textArray.Count(t => t == c);
I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.
You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.
Very simple guessing approach using LINQ:
static class CsvSeperatorDetector
{
private static readonly char[] SeparatorChars = {';', '|', '\t', ','};
public static char DetectSeparator(string csvFilePath)
{
string[] lines = File.ReadAllLines(csvFilePath);
return DetectSeparator(lines);
}
public static char DetectSeparator(string[] lines)
{
var q = SeparatorChars.Select(sep => new
{Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
.OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
.ThenBy(res => res.Found.Count())
.First();
return q.Separator;
}
}
What this does is it reads the file line by line (note that CSV files may include line breaks), then checks for each potential separator how often it occurs in each line.
Then we check which separator occurs on the most lines, and of those which occur on the same number of lines, we take the one with the most even distribution (e.g. 5 occurences on every line are ranked higher than one that occurs once in one line and 10 times in another line).
Of course you might have to tweak this for your own purposes, add error handling, fallback logic and so forth. I'm sure it's not perfect, but it's good enough for me.
You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.
In python we can do this easily by using csv sniffer. It will cater for text files and also if you just need to read some bytes from the file.

Regex to remove this string in C#?

I am scripting Agent Jobs using SMO for SQL Server and the resulting script strings have a have parameter and value that I want to remove from the final version I am storing. The portion of the script that I want to look at is the schedule being added to the job, where it includes a #schedule_uid parameter with a GUID associated with it. I'd like to remove this entirely from the script.
EXEC #ReturnCode = msdb.dbo.sp_add_jobschedule #job_id=#jobId, #name='Job Name',
#enabled=1,
#freq_type=4,
#freq_interval=1,
#freq_subday_type=4,
#freq_subday_interval=10,
#freq_relative_interval=1,
#freq_recurrence_factor=0,
#active_start_date=20150119,
#active_end_date=99991231,
#active_start_time=0,
#active_end_time=235959,
#schedule_uid=N'a70709af-bce7-4c65-a4cd-7574acd31ca2'
The part that I want to replace is the following:
, \r\n\t\t#schedule_uid=N'a70709af-bce7-4c65-a4cd-7574acd31ca2'
So that the final string is:
EXEC #ReturnCode = msdb.dbo.sp_add_jobschedule #job_id=#jobId, #name='Job Name',
#enabled=1,
#freq_type=4,
#freq_interval=1,
#freq_subday_type=4,
#freq_subday_interval=10,
#freq_relative_interval=1,
#freq_recurrence_factor=0,
#active_start_date=20150119,
#active_end_date=99991231,
#active_start_time=0,
#active_end_time=235959
I've tried various combinations of things I've been reading online but I can't seem to make it replace or even match. I know that the regex for the guid matching is:
\b[A-F0-9]{8}(?:-[A-F0-9]{4}){3}-[A-F0-9]{12}\b'
I've tried to add this into a number of things, and thought that the following regex would work but can't figure out what I'm doing wrong or missing
#", \r\n\t\t#schedule_uid=N'\b[A-F0-9]{8}(?:-[A-F0-9]{4}){3}-[A-F0-9]{12}\b'"
#", \r\n\t\t#schedule_uid=N'[A-F0-9]{8}(?:-[A-F0-9]{4}){3}-[A-F0-9]{12}'"
#", \r\n\t\t\b#schedule_uid=N'[A-F0-9]{8}(?:-[A-F0-9]{4}){3}-[A-F0-9]{12}'\b"
I'm not looking for a solution as much as I'd like to know what I'm missing. I've been reading the regular-expressions.info site for a while and I'm usually able to figure out the correct regex, but this has had me stumped for a few days now.
EDIT:
It's not always the last item and it's not guaranteed to only occur once within the script since a job can have multiple schedules which have different #schedule_uid's and I want to get rid of all of them without looping. This is why I chose Regex for the operation. It also needs to remove the comma at the end of the previous parameters line for the code to remain syntax correct.
The following seems to work for me and it will enable you to remove all the newlines, tabs etc:
(?:\n|\t|\r|.){1,3}.*\#sc.*'
You can see it working here
There you go:
#schedule_uid=N'[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}'
Created and tested using http://regexpal.com/
Assuming as little as possible, just using basic string operations.
string exec = ...
int i = exec.IndexOf("#schedule_uid");
while (i > -1)
{
int j = i;
//Find the previous comma
while (exec[i] != ',')
i--;
//Find the end, next line, or next comma
while (j < exec.Length && exec[j] != '\r' && exec[j] != ',')
j++;
exec = exec.Remove(i, j - i);
i = exec.IndexOf("#schedule_uid");
}
I'm deliberately ignoring the no looping requirement, in favour of simple code that works. tested vs this...
string exec = #"
EXEC #ReturnCode = msdb.dbo.sp_add_jobschedule, #schedule_uid=N'a70709af-bce7-4c65-a4cd-7574acd31ca2', #job_id=#jobId, #name='Job Name',
#enabled=1,
#freq_type=4,
#freq_interval=1,
#freq_subday_type=4,
#freq_subday_interval=10,
#freq_relative_interval=1,
#freq_recurrence_factor=0,
#schedule_uid=N'a70709af-bce7-4c65-a4cd-7574acd31ca2',
#active_start_date=20150119,
#active_end_date=99991231,
#active_start_time=0,
#active_end_time=235959,
#schedule_uid=N'a70709af-bce7-4c65-a4cd-7574acd31ca2'";
A little more complicated, but works.
string test = "EXEC...";
var lines = test.Split(new char [] { ',' }).ToList();
lines = lines.Select((line, index) =>
{
var indexof = line.IndexOf("#schedule_uid");
if (indexof > -1)
{
if (index == 0)
{
return line.Substring(0, indexof);
}
else
{
return null;
}
}
return line + ",";
})
.Where(line => line != null)
.ToList();
test = string.Join(string.Empty, lines);
JsFiddle Example.

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!
You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979
I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}
It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.
You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');
If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"
Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

Need to pick up line terminators with StreamReader.ReadLine()

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).
The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().
The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.
I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.
I would recommend that you change your architecture to work more like a parser in a compiler.
You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.
In your case the tokens would be:
Column data
Comma
End of Line
You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.
This has the advantages of:
Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change
Here's a sample of what the Lexer would look like:
Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.
enum TokenType
{
ColumnData,
Comma,
LineTerminator
}
class Token
{
public TokenType Type { get; private set;}
public string Data { get; private set;}
public Token(TokenType type)
{
Type = type;
}
public Token(TokenType type, string data)
{
Type = type;
Data = data;
}
}
private IEnumerable<Token> GetTokens(TextReader s)
{
var builder = new StringBuilder();
while (s.Peek() >= 0)
{
var c = (char)s.Read();
switch (c)
{
case ',':
{
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.Comma);
break;
}
case '\r':
{
var next = s.Peek();
if (next == '\n')
{
s.Read();
}
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.LineTerminator);
break;
}
default:
builder.Append(c);
break;
}
}
s.Read();
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
}
private string ExtractText(StringBuilder b)
{
var ret = b.ToString();
b.Remove(0, b.Length);
return ret;
}
Your "parser" code would then look like this:
public void ConvertXLS(TextReader s)
{
var columnData = new List<string>();
bool lastWasColumnData = false;
bool seenAnyData = false;
foreach (var token in GetTokens(s))
{
switch (token.Type)
{
case TokenType.ColumnData:
{
seenAnyData = true;
if (lastWasColumnData)
{
//TODO: do some error reporting
}
else
{
lastWasColumnData = true;
columnData.Add(token.Data);
}
break;
}
case TokenType.Comma:
{
if (!lastWasColumnData)
{
columnData.Add(null);
}
lastWasColumnData = false;
break;
}
case TokenType.LineTerminator:
{
if (seenAnyData)
{
OutputLine(lastWasColumnData);
}
seenAnyData = false;
lastWasColumnData = false;
columnData.Clear();
}
}
}
if (seenAnyData)
{
OutputLine(columnData);
}
}
You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.
I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.
It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.
x.Replace("\r\n", "\u0000") // Or some other unused character
.Replace("\n", "\\x0A") // Or whatever escaping you need
.Replace("\u0000", "\r\n") // Replace the real line breaks
I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.
Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).
The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.
Here's an example of the file after I convert to CSV...
Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.
I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.
I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.
If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:
string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
if (lineCount == 0)
{
lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
columnCount = lineData.length;
++lineCount;
continue;
}
string thisLine = lastLine + currentLine;
lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
if (lineData.Length < columnCount)
{
lastLine += currentLine;
continue;
}
else
{
lastLine = null;
}
......
Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!
Below is a explanation about what I needed:
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.
https://stackoverflow.com/a/12640862/1582188

Categories

Resources