parse csv file, ignoring comma held in quotes c#

parse csv file, ignoring comma held in quotes c# - c#

I am currently looking at splitting a CSV file that is read into an application by the comma, however, there is legitimate comma's held in double quotes that are getting split when i dont want them to be.
when using TextFieldParser this is reading the fields that I am wanting it to read, however its reading all the fields and then i am struggling to get them out on the correct lines.
public string ParseCSVForFields(string dataFileName)
{
var sb = new StringBuilder();
var line = new List<string>();
using (TextFieldParser parser = new TextFieldParser(dataFileName))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
//Processing row
string currentRow = parser.ReadLine();
string[] fields = parser.ReadFields();
foreach (var field in fields)
{
// this is where i am stuck
}
}
}
return null;
}
any and all help would be very much appreciated.
thanks

You are calling both ReadLine and ReadFields. That seems suspicious. Remove the ReadLine part.

Related

Issue renaming two columns in a CSV file instead of one

I need to be able to rename the column in a spreadsheet from 'idn_prod' to 'idn_prod1', but there are two columns with this name.
I have tried implementing code from similar posts, but I've only been able to update both columns. Below you'll find the code I have that just renames both columns.
//locate and edit column in csv
string file1 = #"C:\Users\username\Documents\AppDevProjects\import.csv";
string[] lines = System.IO.File.ReadAllLines(file1);
System.IO.StreamWriter sw = new System.IO.StreamWriter(file1);
foreach(string s in lines)
{
sw.WriteLine(s.Replace("idn_prod", "idn_prod1"));
}
I expect only the 2nd column to be renamed, but the actual output is that both are renamed.
Here are the first couple rows of the CSV:

I'm assuming that you only need to update the column header, the actual rows need not be updated.
var file1 = #"test.csv";
var lines = System.IO.File.ReadAllLines(file1);
var columnHeaders = lines[0];
var textToReplace = "idn_prod";
var newText = "idn_prod1";
var indexToReplace = columnHeaders
.LastIndexOf("idn_prod");//LastIndex ensures that you pick the second idn_prod
columnHeaders = columnHeaders
.Remove(indexToReplace,textToReplace.Length)
.Insert(indexToReplace, newText);//I'm removing the second idn_prod and replacing it with the updated value.
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(file1))
{
sw.WriteLine(columnHeaders);
foreach (var str in lines.Skip(1))
{
sw.WriteLine(str);
}
sw.Flush();
}

Replace foreach(string s in lines) loop with
for loop and get the lines count and rename only the 2nd column.

I believe the only way to handle this properly is to crack the header line (first string that has column names) into individual parts, separated by commas or tabs or whatever, and run through the columns one at a time yourself.
Your loop would consider the first line from the file, use the Split function on the delimiter, and look for the column you're interested in:
bool headerSeen = false;
foreach (string s in lines)
{
if (!headerSeen)
{
// special: this is the header
string [] parts = s.Split("\t");
for (int i = 0; i < parts.Length; i++)
{
if (parts[i] == "idn_prod")
{
// only fix the *first* one seen
parts[i] = "idn_prod1";
break;
}
}
sw.WriteLine( string.Join("\t", parts));
headerSeen = true;
}
else
{
sw.WriteLine( s );
}
}
The only reason this is even remotely possible is that it's the header and not the individual lines; headers tend to be more predictable in format, and you worry less about quoting and fields that contain the delimiter, etc.
Trying this on the individual data lines will rarely work reliably: if your delimiter is a comma, what happens if an individual field contains a comma? Then you have to worry about quoting, and this enters all kinds of fun.
For doing any real CSV work in C#, it's really worth looking into a package that specializes in this, and I've been thrilled with CsvHelper from Josh Close. Highly recommended.

Trying to get certain line endings when using streamreader

I'm trying to get certain line endings when using streamreader in a C# app.
Code:
public static IEnumerable<string> ReadAllLines(string path)
{
if (!File.Exists(path)) return null;
List<string> lines = new List<string>();
using (var reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
lines.Add(reader.ReadLine(#"(\r\n|\n)"));
}
}
return lines.ToArray();
}
you can see where I have reader.ReadLine(#"(\r\n|\n)"); If I write reader.ReadLine(); i have no issues but when I try to add line endings to it like I found online it tells me there is no overload to ReadLine.
Question: Can someone assist me with figuring out how to add certain line endings so I can successfully scan my CSV files?
Update:
So I found a way to add the line endings i was looking for and attempted it three different ways. But I'm still getting \r only one some lines. It doesn't make a lot of sense. Can anyone see any issues with the below lines of code?
var reader = new StreamReader(path, Encoding.Default);
//string text = reader.ReadToEnd();
////// attampt 1 - this gives the best result but is still splitting an a \r in one of the fields
//// List<string> lines = new List<string>(text.Split(new[] {"\r","\n"}, StringSplitOptions.None));
////// attempt 2 This worked almost identical to the option above but seemed faster.
//var lines = Regex.Split(text, "\r\n");
//// attempt 3 - this split both \r and \n separately
// List<string> lines = new List<string>(text.Split("\r\n".ToCharArray()));
any other suggestions on how to do this would be great!

Based on your comment to your question:
so just to explain what is going on i have a CSV file. when you put it in excel i have some lines that go to ZZ and other lines that go to AZ (not as long). the white space at the end of AZ all the way to ZZ gets added to the next line and screws everything. i assumed it was because the line endings were not correct but they are as you state above
Try a String.TrimEnd() method call before adding the string to your list.
public static IEnumerable<string> ReadAllLines(string path)
{
if (!File.Exists(path)) return null;
var lines = new List<string>();
using (var reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
// add the TrimEnd call here
lines.Add(reader.ReadLine().TrimEnd());
}
}
return lines.ToArray();
}

Remove a specific column from a delimited file

I've been working with some big delimited text (~1GB) files these days. It looks like somewhat below
COlumn1 #COlumn2#COlumn3#COlumn4
COlumn1#COlumn2#COlumn3 #COlumn4
where # is the delimiter.
In case a column is invalid I might have to remove it from the whole text file. The output file when Column 3 is invalid should look like this.
COlumn1 #COlumn2#COlumn4
COlumn1#COlumn2#COlumn4
string line = "COlumn1# COlumn2 #COlumn3# COlumn4";
int junk =3;
int columncount = line.Split(new char[] { '#' }, StringSplitOptions.None).Count();
//remove the [junk-1]th '#' and the value till [junk]th '#'
//"COlumn1# COlumn2 # COlumn4"
I's not able to find a c# version of this in SO. Is there a way I can do that? Please help.
EDIT:
The solution which I found myself is like below which does the job. Is there a way I could modify this to a better way so that it narrows down the performance impact it might have in case of large text files?
int junk = 3;
string line = "COlumn1#COlumn2#COlumn3#COlumn4";
int counter = 0;
int colcount = line.Split(new char[] { '#' }, StringSplitOptions.None).Length - 1;
string[] linearray = line.Split(new char[] { '#' }, StringSplitOptions.None);
List<string> linelist = linearray.ToList();
linelist.RemoveAt(junk - 1);
string finalline = string.Empty;
foreach (string s in linelist)
{
counter++;
finalline += s;
if (counter < colcount)
finalline += "#";
}
Console.WriteLine(finalline);

EDITED
This method can be very memory expensive, as your can read in this post, the suggestion should be:
If you need to run complex queries against the data in the file, the right thing to do is to load the data to database and let DBMS to take care of data retrieval and memory management.
To avoid memory consumption you should use a StreamReader to read file line by line
This could be a start for your task, missing your invalid match logic
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string fileName = "temp.txt";
var results = FindInvalidColumns(fileName);
using (var reader = File.OpenText(fileName))
{
while (!reader.EndOfStream)
{
var builder = new StringBuilder();
var line = reader.ReadLine();
if (line == null) continue;
var split = line.Split(new[] { "#" }, 0);
for (var i = 0; i < split.Length; i++)
if (!results.Contains(i))
builder.Append(split[i]);
using (var fs = new FileStream("new.txt", FileMode.Append, FileAccess.Write))
using (var sw = new StreamWriter(fs))
{
sw.WriteLine(builder.ToString());
}
}
}
}
private static List<int> FindInvalidColumns(string fileName)
{
var invalidColumnIndexes = new List<int>();
using (var reader = File.OpenText(fileName))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line == null) continue;
var split = line.Split(new[] { "#" }, 0);
for (var i = 0; i < split.Length; i++)
{
if (IsInvalid(split[i]) && !invalidColumnIndexes.Contains(i))
invalidColumnIndexes.Add(i);
}
}
}
return invalidColumnIndexes;
}
private static bool IsInvalid(string s)
{
return false;
}
}
}

First, what you will do is re-write the line to a text file using a 0-length string for COlumn3. Therefore the line after being written correctly would look like this:
COlumun1#COlumn2##COlumn4
As you can see, there are two delimiters between COlumn2 and COlumn4. This is a cell with no data in it. (By "cell" I mean one column of a certain, single row.) Later, when some other process reads this using the Split function, it will still create a new value for Column 3, but in the array generated by Split, the 3rd position would be an empty string:
String[] columns = stream_reader.ReadLine().Split('#');
int lengthOfThirdItem = columns[2].Length; // for proof
// lengthOfThirdItem = 0
This reduces invalid values to null and persists them back in the text file.
For more on String.Split see C# StreamReader save to Array with separator.
It is not possible to write to lines internal to a text file while it is also open for read. This article discusses it some (simultaneous read-write a file in C#), but it looks like that question-asker just wants to be able to write lines to the end. You want to be able to write lines at any point in the interior. I think this is not possible without buffering the data in some way.
The simplest way to buffer the data is rename the file to a temp file first (using File.CoMovepy() // http://msdn.microsoft.com/en-us/library/system.io.file.move(v=vs.110).aspx). Then use the temp file as the data source. Just open the temp file that to read in the data which may have corrupt entries, and write the data afresh to the original file name using the approach I describe above to represent empty columns. After this is complete, then you should delete the temp file.
Important
Deleting the temp file may leave you vulnerable to power and data transients (or software 'transients'). (I.e., a power drop that interrupts part of the process could leave the data in an unusable state.) So you may also want to leave the temp file on the drive as an emergency backup in case of some problem.

Reading CSV file having Field name in each line?

I am working in an ERP integration software. I need to parse CSV file from HRM application to make an entry.
I am getting the input CSV file like this:
$Emp.No$=123456,$CardNo$=254658,$InTime$="12/11/2013 09:03:05",$OutTime$="12/11/2013 17:25:20"
$Emp.No$=565556,$CardNo$=254689,$InTime$="12/11/2013 09:03:50",$OutTime$="12/11/2013 18:01:11"
The CSV file doesn't have a column name header, instead each field has a field name associated with it inside $FieldName$.
I tried to parse it with CSVHelper. It just works fine, when using ReadFieldsByIndex() method.
Problem:
Some of the columns do not have $InTime$ or $OutTime$. So, reading by index fails. How can I read only available data and how to map according to the field name available in each line.

You haven't got a CSV file there. You have a data file, each line of which contains one or more key/value pairs, separated by commas. The key and value are separated by an = and the key is enclosed by $'s.
Having expressed what you have, that should help you identify a solution:
Don't use a CSV framework.
Read each line at a time from the file.
Split the line on , to give you the key value pairs.
Split the key value pairs on = to give the two parts.
(Optionally) remove the $ from the key name.
You then should have a suitable level of data to transfer these values into whatever destination objects you have.

This will write to the separate file with headers and followed by values.
string file =#"D:\STACKOVERFLOW\csvproblem.txt";
string newfile =#"D:\STACKOVERFLOW\output.txt";
StreamReader sr = new StreamReader(file);
StreamWriter sw = new StreamWriter(newfile);
try{
string header = "";
StringBuilder sb = new StringBuilder();
StringBuilder sb_header = new StringBuilder();
bool recordHeader = true;
while(sr.EndOfStream==false){
string readLine = sr.ReadLine();
string[] split = readLine.Split(',');
sb = new StringBuilder();
foreach (string str in split)
{
if (recordHeader)
{
if (str.IndexOf('$') < str.LastIndexOf('$'))
{
sb_header.AppendFormat("{0},",
str.Substring(str.IndexOf('$'),str.IndexOf('$')+str.LastIndexOf('$')+1));
}
}
sb.AppendFormat("{0},", str.Substring(str.IndexOf('=')+1));
}
if (recordHeader)
{
sw.WriteLine(sb_header.ToString().Trim(','));
}
sw.WriteLine(sb.ToString().Trim(','));
recordHeader = false;
}
}
finally{
sr.Close();
sw.Close();
}

How to keep quotes when parsing csv file?

I am using Microsoft.VisualBasic.FileIO.TextFieldParser to read a csv file, edit it , then parse it.
The problem is the quotes are not being kept after parsing.
I tried using parser.HasFieldsEnclosedInQuotes = true; but it does not seem to keep the quotes for some reason.
This issue breaks when a field contains a quote for example :
Before
"some, field"
After
some, field
As two seperate fields
Here is my method
public static void CleanStaffFile()
{
String path = #"C:\file.csv";
String dpath = String.Format(#"C:\file_{0}.csv",DateTime.Now.ToString("MMddyyHHmmss"));
List<String> lines = new List<String>();
if (File.Exists(path))
{
using (TextFieldParser parser = new TextFieldParser(path))
{
parser.HasFieldsEnclosedInQuotes = true;
parser.Delimiters = new string[] { "," };
while (!parser.EndOfData)
{
string[] parts = parser.ReadFields();
if (parts == null)
{
break;
}
if ((parts[12] != "") && (parts[12] != "*,116"))
{
parts[12] = parts[12].Substring(0, 3);
}
else
{
parts[12] = "0";
}
lines.Add(string.Join(",", parts));
}
}
using (StreamWriter writer = new StreamWriter(dpath, false))
{
foreach (String line in lines)
writer.WriteLine(line);
}
}
MessageBox.Show("CSV file successfully processed :\n");
}

So you want to have quotes after you have modified it at string.Join(",", parts)? Then it's easy since only fields which contain the separator were wrapped in quotes before. Just add them again before the String.Join.
So before (and desired):
"some, field"
after(not desired):
some, field
This should work:
string[] fields = parser.ReadFields();
// insert your logic here ....
var newFields = fields
.Select(f => f.Contains(",") ? string.Format("\"{0}\"", f) : f);
lines.Add(string.Join(",", newFields));
Edit
I would like to keep quotes even if doesn't contain a comma
Then it's even easier:
var newFields = fields.Select(f => string.Format("\"{0}\"", f));

The TextFieldParser.HasFieldsEnclosedInQuotes property is used as follows, from the MSDN page:
If the property is True, the parser assumes that fields are enclosed in quotation marks (" ") and may contain line endings.
If a field is enclosed in quotation marks, for example, abc, "field2a,field2b", field3 and this property is True, then all text enclosed in quotation marks will be returned as is; this example would return abc|field2a,field2b|field3. Setting this property to False would make this example return abc|"field2a|field2b"|field3.
The quotes will indicate the start and end of a field, which may then contain the character(s) used to normally separate fields. If your data itself has quotes, you need to set HasFieldsEnclosedInQuotes to false.
If your data fields can contain both separators and quotes, you will need to start escaping quotes before parsing, which is a problem. Basicly you're going beyond the capabilities of a simple CSV file.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.