Converting Excel to CSV with C# getting extra , on each row - c#

Am converting an Excel file to a CSV in a Azure WebJob to keep the file in blob storage during the process using C# but am getting an extra comma after each row in my csv file.
Example:
1,Test,Doe,
2,Test,John,
Here is my code for producing the csv:
public static class ExcelToCSVConvertor
{
public static List<BlobInput> Convert(List<BlobOutput> inputs)
{
var dataForBlobInput = new List<BlobInput>();
try
{
foreach (BlobOutput item in inputs)
{
using (SpreadsheetDocument document = SpreadsheetDocument.Open(item.BlobContent, false))
{
foreach (Sheet _Sheet in document.WorkbookPart.Workbook.Descendants<Sheet>())
{
WorksheetPart _WorksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(_Sheet.Id);
Worksheet _Worksheet = _WorksheetPart.Worksheet;
SharedStringTablePart _SharedStringTablePart = document.WorkbookPart.GetPartsOfType<SharedStringTablePart>().First();
SharedStringItem[] _SharedStringItem = _SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ToArray();
StringBuilder stringBuilder = new StringBuilder();
foreach (var row in _Worksheet.Descendants<Row>())
{
foreach (Cell _Cell in row)
{
string Value = string.Empty;
if (_Cell.CellValue != null)
{
if (_Cell.DataType != null && _Cell.DataType.Value == CellValues.SharedString)
Value = _SharedStringItem[int.Parse(_Cell.CellValue.Text)].InnerText;
else
Value = _Cell.CellValue.Text;
}
stringBuilder.Append(string.Format("{0},", Value.Trim()));
}
stringBuilder.Append("\n");
}
byte[] data = Encoding.UTF8.GetBytes(stringBuilder.ToString().Trim());
string fileNameWithoutExtn = item.BlobName.ToString().Substring(0, item.BlobName.ToString().IndexOf("."));
string newFilename = $"{fileNameWithoutExtn}_{_Sheet.Name}.csv";
dataForBlobInput.Add(new BlobInput { BlobName = newFilename, BlobContent = data });
}
}
}
}
catch (Exception Ex)
{
throw Ex;
}
return dataForBlobInput;
}
}

This line is adding comma after EVERY single value
stringBuilder.Append(string.Format("{0},", Value.Trim()));
1,
Test,
Doe,
2,
Test,
John,
You need to exclude the comma on last value in row (last iteration).
On last iteration foreach (Cell _Cell in row) stringBuilder.Append should be
stringBuilder.Append(string.Format("{0}", Value.Trim()));

Related

New line within CSV column causing issue

I have a large csv file which has millions of rows. The sample csv lines are
CODE,COMPANY NAME, DATE, ACTION
A,My Name , LLC,2018-01-28,BUY
B,Your Name , LLC,2018-01-25,SELL
C,
All Name , LLC,2018-01-21,SELL
D,World Name , LLC,2018-01-20,BUY
Row C has new line, but actually this is same record. I want to remove new line character from the csv line within cell\field\column.
I tired \r\n, Envirnment.NewLine and many other things, but could not make it work.
Here is my code..
private DataTable CSToDataTable(string csvfile)
{
Int64 row = 0;
try
{
string CSVFilePathName = csvfile; //#"C:\test.csv";
string[] Lines = File.ReadAllLines(CSVFilePathName.Replace(Environment.NewLine, ""));
string[] Fields;
Fields = Lines[0].Split(new char[] { ',' });
int Cols = Fields.GetLength(0);
DataTable dt = new DataTable();
//1st row must be column names; force lower case to ensure matching later on.
for (int i = 0; i < Cols; i++)
dt.Columns.Add(Fields[i].ToLower(), typeof(string));
DataRow Row;
for (row = 1; row < Lines.GetLength(0); row++)
{
Fields = Lines[row].Split(new char[] { ',' });
Row = dt.NewRow();
//Console.WriteLine(row);
for (int f = 0; f < Cols; f++)
{
Row[f] = Fields[f];
}
dt.Rows.Add(Row);
if (row == 190063)
{
}
}
return dt;
}
catch (Exception ex)
{
throw ex;
}
}
How can I remove new line character and read the row correctly? I don't want to skip the such rows as per the business requirement.
You CSV file is not in valid format. In order to parse and load them successfully, you will have to sanitize them. Couple of issues
COMPANY NAME column contains field separator in it. Fix them by
surrounding quotes.
New line in CSV value - This can be fixed by combining adjacent rows as one.
With Cinchoo ETL, you can sanitize and load your large file as below
string csv = #"CODE,COMPANY NAME, DATE, ACTION
A,My Name , LLC,2018-01-28,BUY
B,Your Name , LLC,2018-01-25,SELL
C,
All Name , LLC,2018-01-21,SELL
D,World Name , LLC,2018-01-20,BUY";
string bufferLine = null;
var reader = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
.Setup(s => s.BeforeRecordLoad += (o, e) =>
{
string line = (string)e.Source;
string[] tokens = line.Split(",");
if (tokens.Length == 5)
{
//Fix the second and third value with quotes
e.Source = #"{0},""{1},{2}"",{3}, {4}".FormatString(tokens[0], tokens[1], tokens[2], tokens[3], tokens[4]);
}
else
{
//Fix the breaking lines, assume that some csv lines broken into max 2 lines
if (bufferLine == null)
{
bufferLine = line;
e.Skip = true;
}
else
{
line = bufferLine + line;
tokens = line.Split(",");
e.Source = #"{0},""{1},{2}"",{3}, {4}".FormatString(tokens[0], tokens[1], tokens[2], tokens[3], tokens[4]);
line = null;
}
}
});
foreach (var rec in reader)
Console.WriteLine(rec.Dump());
//Careful to load millions rows into DataTable
//var dt = reader.AsDataTable();
Hope it helps.
You haven't made it clear what are the possible criteria an unwanted new line could appear in the file. So assuming that a 'proper' line in the CSV file does NOT end with a comma, and if one ends with a comma that means that it's not a properly formatted line, you could do something like this:
static void Main(string[] args)
{
string path = #"CSVFile.csv";
List<CSVData> data = new List<CSVData>();
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs))
{
sr.ReadLine(); // Header
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
while (line.EndsWith(","))
{
line += sr.ReadLine();
}
var items = line.Split(new string[] { "," }, StringSplitOptions.None);
data.Add(new CSVData() { CODE = items[0], NAME = items[1], COMPANY = items[2], DATE = items[3], ACTION = items[4] });
}
}
}
Console.ReadLine();
}
public class CSVData
{
public string CODE { get; set; }
public string NAME { get; set; }
public string COMPANY { get; set; }
public string DATE { get; set; }
public string ACTION { get; set; }
}
Obviously there's a lot of error handling to be done here (for example, when creating a new CSVData object make sure your items contain all the data you want), but I think this is the start you need.

How to skip headline in csv data when reading from StreamReader?

EDITED:
I have following code:
private void button1_Click_1(object sender, EventArgs e)
{
var date = new List<String>();
var value = new List<Double>();
string dir = #"C:\Main\test.csv";
using (var reader = new System.IO.StreamReader(dir))
{
var lines = File.ReadLines(dir)
.Skip(1);//Ignore the first line
foreach (var line in lines)
{
var fields = line.Split(new Char[] { ';' }, StringSplitOptions.RemoveEmptyEntries);
date.Add(fields[0]);
if (fields.Length > 1)
value.Add(Convert.ToDouble(fields[1]));
}
String[] _date = date.ToArray();
Double[] _value = value.ToArray();
chart1.Series["Test"].Points.DataBindXY(_date,_value);
chart1.Series["Test"].ChartType = SeriesChartType.Spline;
}
}
Now I want to skip the headline of the csv data. That means the first row of the first column and the first row of the second column. How to do that?
The headlines are Strings.When no headlines are in, he will skip the first row but with headlines I get a System.FormatException.
It fails when the first row contains Date in the first column and Value in the second column like that (opened with texteditor):
"Date";"Value"
"20.04.2010";"82.6619508214314"
"21.04.2010";"33.2262968571519"
"22.04.2010";"25.0174973120814"
Why not just start by reading one line, and doing nothing with it?
using (var reader = new System.IO.StreamReader(dir))
{
reader.ReadLine(); // skip first
string line;
while ((line = reader.ReadLine()) != null)
{
}
}
Add one reader.ReadLine() before doing the while loop
using (var reader = new System.IO.StreamReader(dir))
{
if (reader.ReadLine()) //read first line
{
string line;
while ((line = reader.ReadLine()) != null) //read following lines
{
}
}
}

How to read tab delimited lines by skipping alternate lines

I am currently able to parse and extract data from large tab delimited file. I am reading, parsing and extracting line by line and adding the split items in my Data table (Row Limit adding 3 rows at a time). I need to skip even lines i.e. Read first maximum tab delimited line and then skip 2nd one and read the third one directly.
My Tab delimited source file format
001Mean 26.975 1.1403 910.45
001Stdev 26.975 1.1403 910.45
002Mean 26.975 1.1403 910.45
002Stdev 26.975 1.1403 910.45
Need to skip or avoid reading Stdev tab delimited lines.
C# Code:
Getting the Maximum length of items in a tab delimited line of the file by splitting a line
using (var reader = new StreamReader(sourceFileFullName))
{
string line = null;
line = reader.ReadToEnd();
if (!string.IsNullOrEmpty(line))
{
var list_with_max_cols = line.Split('\n').OrderByDescending(y => y.Split('\t').Count()).Take(1);
foreach (var value in list_with_max_cols)
{
var values = value.ToString().Split(new[] { '\t', '\n' }).ToArray();
MAX_NO_OF_COLUMNS = values.Length;
}
}
}
Reading the file line by line until maximum length in a tab delimited line is satisfied as first line to parse and extract
using (var reader = new StreamReader(sourceFileFullName))
{
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;
//when reach first line it is column list need to create datatable based on that.
if (firstLineOfFile)
{
columnData = new_read_line;
firstLineOfFile = false;
continue;
}
if (firstLineOfChunk)
{
firstLineOfChunk = false;
chunkDataTable = CreateEmptyDataTable(columnData);
}
AddRow(chunkDataTable, new_read_line);
chunkRowCount++;
if (chunkRowCount == _chunkRowLimit)
{
firstLineOfChunk = true;
chunkRowCount = 0;
yield return chunkDataTable;
chunkDataTable = null;
}
}
}
Creating Data Table:
private DataTable CreateEmptyDataTable(string firstLine)
{
IList<string> columnList = Split(firstLine);
var dataTable = new DataTable("TableName");
for (int columnIndex = 0; columnIndex < columnList.Count; columnIndex++)
{
string c_string = columnList[columnIndex];
if (Regex.Match(c_string, "\\s").Success)
{
string tmp = Regex.Replace(c_string, "\\s", "");
string finaltmp = Regex.Replace(tmp, #" ?\[.*?\]", ""); // To strip strings inside [] and inclusive [] alone
columnList[columnIndex] = finaltmp;
}
}
dataTable.Columns.AddRange(columnList.Select(v => new DataColumn(v)).ToArray());
dataTable.Columns.Add("ID");
return dataTable;
}
How to skip lines by reading alternatively and split and then add to my datatable !!!
AddRow Function : Managed to achieve my requirement by adding following changes !!!
private void AddRow(DataTable dataTable, string line)
{
if (line.Contains("Stdev"))
{
return;
}
else
{
//Rest of Code
}
}
Considering you have tab separated values in each line, how about reading the odd lines and splitting them into arrays. This is just a sample; you can expand upon this.
Test data (file.txt)
luck is when opportunity meets preparation
this line needs to be skipped
microsoft visual studio
another line to be skipped
let us all code
Code
var oddLines = File.ReadLines(#"C:\projects\file.txt").Where((item, index) => index%2 == 0);
foreach (var line in oddLines)
{
var words = line.Split('\t');
}
Debug screen shots
EDIT
To get lines that don't contain 'Stdev'
var filteredLines = System.IO.File.ReadLines(#"C:\projects\file.txt").Where(item => !item.Contains("Stdev"));
Change
using (var reader = new StreamReader(sourceFileFullName))
{
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;
To
using (var reader = new StreamReader(sourceFileFullName))
{
int cnt = 0;
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
cnt++;
if(cnt % 2 == 0)
continue;
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;

Inserting text in Excel using Open Xml SDK and LINQ

I have been trying really hard but my excel sheet is not populating as per my expectations. If the content is having string datatype then sheet is showing '0' in place of that, howsoever hard I tried by using conversions. I am pasting my code below if any one can help me:
public static void WriteExcelDocument(string FilePath)
{
try
{
using (SpreadsheetDocument spreadSheet = SpreadsheetDocument.Open(FilePath, true))
{
WorkbookPart workbookPart = spreadSheet.WorkbookPart;
IEnumerable<Sheet> Sheets = spreadSheet.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == "data");
if (Sheets.Count() == 0)
{
// The specified worksheet does not exist.
return;
}
string relationshipId = Sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)spreadSheet.WorkbookPart.GetPartById(relationshipId);
SheetData sheetData = worksheetPart.Worksheet.GetFirstChild<SheetData>();
int index = 2;
SpectrumNewEntities context = new SpectrumNewEntities();
var q = from result in context.Appraisers
select result;
foreach (var g in q)
{
string Name = g.AppraiserName!=null?g.AppraiserName:String.Empty;
string city = g.City != null ? g.City : String.Empty;
string Address = g.Address != null ? g.Address : "NA";
int AppId = g.AppraiserAppraiserCompanyId != null ? (int)g.AppraiserAppraiserCompanyId : 0;
string email = g.Email != null ? g.Email : String.Empty;
Row contentRow = CreateContentRow(index, Name, city, Address, AppId,email);
index++;
sheetData.AppendChild(contentRow);
}
// Save the worksheet.
worksheetPart.Worksheet.Save();
}
}
catch (Exception)
{
throw;
}
}
private static Row CreateContentRow(int index, string Name, string city, string Address, int AppId, string email)
{
try
{
//Create new row
Row r = new Row();
r.RowIndex = (UInt32)index;
//First cell is a text cell, so create it and append it
Cell firstCell = CreateTextCell(headerColumns[0], Name, index);
r.AppendChild(firstCell);//
//create cells that contain data
for (int i = 1; i < headerColumns.Length; i++)
{
Cell c = new Cell();
c.CellReference = headerColumns[i] + index;
CellValue v = new CellValue();
if (i == 1)
{
v.Text = city.ToString();
}
if (i == 2)
{
v.Text = Address.ToString();
}
if (i == 3)
{
v.Text =AppId.ToString();
}
if (i == 4)
{
v.Text = email.ToString();
}
c.AppendChild(v);
r.AppendChild(c);
}
return r;
}
catch (Exception)
{
throw;
}
}
private static Cell CreateTextCell(string header, string Name,int index)
{
try
{
//Create a new inline string cell
Cell c = new Cell();
c.DataType = CellValues.InlineString;
c.CellReference = header + index;
//Add text to text cell
InlineString inlineString = new InlineString();
Text t = new Text();
t.Text = Name;
inlineString.AppendChild(t);
c.AppendChild(inlineString);
return c;
}
catch (Exception)
{
throw;
}
}
I am not getting why display is something like this?
Use the CreateTextCell method to add all the text cells to the row; like you are doing for the Name field.

EPPlus Reading Column Headers

Is there an easy way to tell EPPlus that a row is a header? Or should I create the headers by specifying a range using SelectedRange, remove it from the sheet and iterate the cells that remain?
I ended up doing this:
class Program
{
static void Main(string[] args)
{
DirectoryInfo outputDir = new DirectoryInfo(#"C:\testdump\excelimports");
FileInfo existingFile = new FileInfo(outputDir.FullName + #"\Stormers.xlsx");
Dictionary<string, string> arrColumnNames = new Dictionary<string,string>() { { "First Name", "" }, { "Last Name", "" }, { "Email Address", "" } };
using (ExcelPackage package = new ExcelPackage(existingFile))
{
ExcelWorksheet sheet = package.Workbook.Worksheets[1];
var q = from cell in sheet.Cells
where arrColumnNames.ContainsKey(cell.Value.ToString())
select cell;
foreach (var c in q)
{
arrColumnNames[c.Value.ToString()] = c.Address;
}
foreach (var ck in arrColumnNames)
{
Console.WriteLine("{0} - {1}", ck.Key, ck.Value);
}
var qValues = from r in sheet.Cells
where !arrColumnNames.ContainsValue(r.Address.ToString())
select r;
foreach (var r in qValues)
{
Console.WriteLine("{0} - {1}", r.Address, r.Value);
}
}
}
}
I needed to enumerate through header and display all the columns headers to my end user. I took Muhammad Mubashir code as base and changed/converted it to extension method and removed hard-coded numbers from it.
public static class ExcelWorksheetExtension
{
public static string[] GetHeaderColumns(this ExcelWorksheet sheet)
{
List<string> columnNames = new List<string>();
foreach (var firstRowCell in sheet.Cells[sheet.Dimension.Start.Row, sheet.Dimension.Start.Column, 1, sheet.Dimension.End.Column])
columnNames.Add(firstRowCell.Text);
return columnNames.ToArray();
}
}
var pck = new OfficeOpenXml.ExcelPackage();
pck.Load(new System.IO.FileInfo(path).OpenRead());
var ws = pck.Workbook.Worksheets["Worksheet1"];
DataTable tbl = new DataTable();
var hasHeader = true;
foreach (var firstRowCell in ws.Cells[1, 1, 1, ws.Dimension.End.Column]){
tbl.Columns.Add(hasHeader ? firstRowCell.Text : string.Format("Column {0}", firstRowCell.Start.Column));
}
var startRow = hasHeader ? 2 : 1;
for (var rowNum = startRow; rowNum <= ws.Dimension.End.Row; rowNum++){
var wsRow = ws.Cells[rowNum, 1, rowNum, ws.Dimension.End.Column];
var row = tbl.NewRow();
foreach (var cell in wsRow){
row[cell.Start.Column - 1] = cell.Text;
}
tbl.Rows.Add(row);
}
I had a similar issue. Here's some code that may help:
using (var package = new ExcelPackage(fileStream))
{
// Get the workbook in the file
var workbook = package.Workbook;
if (workbook != null && workbook.Worksheets.Any())
{
// Get the first worksheet
var sheet = workbook.Worksheets.First();
// Get header values
var column1Header = sheet.Cells["A1"].GetValue<string>();
var column2Header = sheet.Cells["B1"].GetValue<string>();
// "A2:A" means "starting from A2 (1st col, 2nd row),
// get me all populated cells in Column A" (yes, unusual range syntax)
var firstColumnRows = sheet.Cells["A2:A"];
// Loop through rows in the first column, get values based on offset
foreach (var cell in firstColumnRows)
{
var column1CellValue = cell.GetValue<string>();
var column2CellValue = cell.Offset(0, 1).GetValue<string>();
}
}
}
If anyone knows of a more elegant way than cell.Offset, let me know.
I just took ndd code and convert it with using of System Linq.
using System.Linq;
using OfficeOpenXml;
namespace Project.Extensions.Excel
{
public static class ExcelWorksheetExtension
{
/// <summary>
/// Get Header row with EPPlus.
/// <a href="https://stackoverflow.com/questions/10278101/epplus-reading-column-headers">
/// EPPlus Reading Column Headers
/// </a>
/// </summary>
/// <param name="sheet"></param>
/// <returns>Array of headers</returns>
public static string[] GetHeaderColumns(this ExcelWorksheet sheet)
{
return sheet.Cells[sheet.Dimension.Start.Row, sheet.Dimension.Start.Column, 1, sheet.Dimension.End.Column]
.Select(firstRowCell => firstRowCell.Text).ToArray();
}
}
}

Categories

Resources