How to handle quotation marks within CSV files? - c#

To read a CSV file, I use the following statement:
var query = from line in rawLines
let data = line.Split(';')
select new
{
col01 = data[0],
col02 = data[1],
col03 = data[2]
};
The CSV file I want to read is malformed in the way, that an entry can have the separator ; itself as data when surrounded with qutation marks.
Example:
col01;col02;col03
data01;"data02;";data03
My read statement above does not work here, since it interprets the second row as four columns.
Question: Is there an easy way to handle this malformed CSV correctly? Perhaps with another LINQ query?

Just use a CSV parser and STOP ROLLING YOUR OWN:
using (var parser = new TextFieldParser("test.csv"))
{
parser.CommentTokens = new string[] { "#" };
parser.SetDelimiters(new string[] { ";" });
parser.HasFieldsEnclosedInQuotes = true;
// Skip over header line.
parser.ReadLine();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
Console.WriteLine("{0} {1} {2}", fields[0], fields[1], fields[2]);
}
}
TextFieldParser is built in .NET. Just add reference to the Microsoft.VisualBasic assembly and you are good to go. A real CSV parser will happily handle this situation.

Parsing CSV files manually can always lead to issues like this. I would advise that you use a third party tool like CsvHelper to handle the parsing.
Furthermore, it's not a good idea to explicitly parse commas, as your separator can be overridden in your computers environment options.
Let me know if I can help further,
Matt

Not very elegant but after using your method you can check if any colxx contains an unfinished quotation mark (single) you can join it with the next colxx.

Related

Can someone please confirm the reason behind foreach loop giving error as "invalid token" and "splittedText" as does not exist in current context?

string[] splittedText = File.ReadAllLines(#"file.txt");//.Split(',');
foreach (string data in splittedText)
{
}
I want to read through a file in c# which returns array of string type. Then, I will be iterating over the array to fetch my desired data.
If you want to read a CSV file, you should use a CVS parser. Values in the CSV file are separated using command and in some cases, the value in the CSV file can also contain a comma. In that case, the column values are wrapped in double-quotes. And this solution will not handle that scenario.
var splittedText = File.ReadAllText("E:\\Test.txt").Split(',');
foreach (string data in splittedText)
{
Console.WriteLine(data.Trim());
}
Hint - Reading file line by line or Reading whole file content depends on your use case. May be below code snippet give some idea on how to split the content.
Please try.
var inputtext = File.ReadAllText(#"inpufile.txt");
inputtext.Replace("\n", "")
.Split(',',
StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
.ToList().ForEach(t =>
{
System.Console.WriteLine(t);
//Other manupulations
});
if you want to split based on multiple characters , pass a character array to the split().
new char[] { ',', ':' };
Thank you.
You need change File.ReadAllLines to File.ReadAllText(path) then you can split method.

How to avoid false separators in csv / XML

I've been trying to understand how XML and CSV parsing work, without actually writing any code yet. I might have to parse a .csv file in the ongoing project and I'd like to be ready. (I'll have to convert them to .ofx files)
I'm also aware there's probably a thousand XLM and csv parsers out there, so I'm more curious than I am worried. I intend on using the XMLReader that I believe microsoft provides.
Let's say I have the following .csv file
02/02/2016 ; myfirstname ; mylastname ; somefield ; 321654 ; commentary ; blabla
Sometimes a field will be missing. Which means, for the sake of the example, that the lastname isn't mandatory, and somefield could be right after the first name.
My questions are :
How do I avoid the confusion between somefield and lastname?
I could count the total number of fields, but in my situation two are optional, if there is only one missing, I can't be sure which one it is.
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
Again, I could count the remaining fields and find out where I am, but that excludes the optional fields problem.
My questions also apply to XML, what can I do if the user starts writing XML in his form ? Wether I decide to export the form as .csv or .xml, there can be trouble.
Right now I'm on the assumption that the c# Xml reader/parser are awesome enough to deal with it ; and if they are, I'm really curious on the how.
Assuming the CSV/XML data has been exported properly, none of this will be a problem. Missing fields will be handled by repeated separators:
02/02/2016;myfirstname;;somefield
Semi-colons within a field will normally be handled by quoting:
02/02/2016;"myfirst;name";
Quotes are escaped within a string:
02/02/2016;"my""first""name";
With XML it's even less of an issue since the tags or attributes will all have names.
If your CSV data is NOT well-formed, then you have a much bigger problem, as it may be impossible to distinguish missing fields and non-quoted separators.
How do I avoid false "tags"? String values should be quoted if the (can) contain separator characters. If you create the CSV file, quote and unquote all string values.
How do I avoid the confusion between somefield and lastname? No general solution for this, all case must be handled one by one. Can a general algorithm decide wheather first name or last name is missing? No.
If you know what field(s) can be omitted, you can write an "intelligent" handling.
Use XML and all of your problem will be solved.
Fisrt
How do I avoid the confusion between somefield and lastname?
There is no way to do this without change the logic of file. For example: when "mylastname" is empty You may have a "" value, empty string or like this ;;
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
It is simple you have to file like this:
; - separor of columns
"" - delimetr of columns
value;value;"value;;;;value";value
To split this only for separtor ; without the separator in "" this code do this is tested and compiled
public static string[] SplitWithDelimeter(this string line, char separator, char checkSeparator, bool eraseCheckSeparator)
{
var separatorsIndexes = new List<int>();
var open = false;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == checkSeparator)
{
open = !open;
}
if (!open && line[i] == separator )
{
separatorsIndexes.Add(i);
}
}
separatorsIndexes.Add(line.Length);
var result = new string[separatorsIndexes.Count];
var first = 0;
for (var j = 0; j < separatorsIndexes.Count; j++)
{
var tempLine = line.Substring(first, separatorsIndexes[j] - first);
result[j] = eraseCheckSeparator ? tempLine.Replace(checkSeparator, ' ').Trim() : tempLine;
first = separatorsIndexes[j] + 1;
}
return result;
}
Return would be:
value
value
"value;;;;value"
value

CSV (or excel) parsing ; eliminate empty column

I am using TextFieldParser class to parse the file. I want to eliminate or ignore complete column if "entire column" is empty (which means single empty cell of a perticular row should be considered) Is this possible?
Note: as per functionality, I need to use data copied to clipboard. So can not pass direct file path to the parser.
TextFieldParser parser = new TextFieldParser(new StringReader(row));
string[] delimiters = { ",", "\t" };
parser.SetDelimiters(delimiters);
string[] columns = null;
while (!parser.EndOfData)
{
columns = parser.ReadFields();
}
Appreciate your help.
After reading through the TextFieldParser Class page on MSDN, I see that there is nothing written there that would make me think that this class can ignore a whole column. That would be something that you would have to do manually. Furthermore, your code does not seem right because you are trying to read the fields repeatedly with the same variable:
while (!parser.EndOfData)
{
columns = parser.ReadFields();
}

Processing CSV file using C#

I am creating a CSV Importing tool (comma separated). I am trying to make this importing tool as generic as possible , so that it can process any CSV File.
I have almost finalised the tool , but came across one file which I am finding it difficult to process.
How can I process the file with data in following format?
column1,column2,column3,column4,column5
----------
alex,p,22323,23232,hello
mike,t,"121212,232323,4343434",33432,hi
guna,s,"2423,2332",whats
cena,a,34443,33432,up
Since the file is comma separated, and one of its value is comma separated as well between identifier "value,value,value" I am finding it difficult to process.
How can i tackle this issue?
I donot have control over CSV file. So I cant change the format
As per #dtb... use a CSV parser. If you reference Microsoft.VisualBasic then you can:
var data=#"column1,column2,column3,column4,column5
----------
alex,p,22323,23232,hello
mike,t,""121212,232323,4343434"",33432,hi
guna,s,""2423,2332"",whats
cena,a,34443,33432,up";
using (var sr = new StringReader(data))
using (var parser =
new TextFieldParser(sr)
{
TextFieldType = FieldType.Delimited,
Delimiters = new[] { "," },
CommentTokens = new[] { "--" }
})
{
while (!parser.EndOfData)
{
string[] fields;
fields = parser.ReadFields();
//yummy
}
}
This deals with quotes correctly.

How to replace > 1 tokens on a string?

I have a list of tokens
[#date]
[#time]
[#fileName]
... etc
That are dispersed all over a large file. I can parse the file and replace them with Regex.Replace easily when there's only one token on a line. However the problem arises when there's two tokens on one line
example:
[#date] [#time]
What I thought about doing is using String.Split with " " as the delimiter, and then iterate through the result checking if there are tokens.
But I see two problems with this approach, the file is rather large and this would definitely impact performance. The second problem is that the file that will be outputted is a SQL file and I'd like to retain the white space just for looks.
Is there a more elegant solution to this problem? Or is it just another case of premature optimization?
One thing you can do is that instead of replacing patterns line by line, replace them in the whole file:
string fileContent = File.ReadAllText(path);
fileContent = Regex.Replace(fileContent, pattern1, replacement1);
...
fileContent = Regex.Replace(fileContent, patternN, replacementN);
One simple way to do this is to store tokens and their values separately and then to iterate over them replacing your query with values for that tokens. Example is given below:
var tokensWithValues = new Dictionary<string, object>()
{
{"[#date]", DateTime.Now},
{"[#time]", DateTime.Now.Ticks},
{"[#fileName]", "myFile.xml"},
};
var sqlQuery = File.ReadAllText("mysql.sql");
foreach (var tokenValue in tokensWithValues)
{
sqlQuery = sqlQuery.Replace(tokenValue.Key, tokenValue.Value.ToString());
}
try using the string.Replace(...) extension. That would allow you to Replace all instances of a string.
for example
string file = File.ReallAllText("myfile.txt");
file.Replace("[#date]", "replaced_value");
the above would replace all instances of "[#date]" with "replaced_value".
Edited as previous answer included custom extensions not available to OP. Thanks llya.

Categories

Resources