How to read faster in OpenXML format

How to read faster in OpenXML format - c#

When I used OLEDB, it takes only 2 - 3 seconds to read 3200 rows from an Excel Sheet. Now I changed to OpenXML format and now it takes more than 1 minute to read 3200 rows from an Excel Sheet.
Below is my code:
public static DataTable ReadExcelFileDOM(string filename)
{
DataTable table;
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true))
{
WorkbookPart workbookPart = myDoc.WorkbookPart;
Sheet worksheet = workbookPart.Workbook.Descendants<Sheet>().First();
WorksheetPart worksheetPart =
(WorksheetPart)(workbookPart.GetPartById(worksheet.Id));
SheetData sheetData =
worksheetPart.Worksheet.Elements<SheetData>().First();
List<List<string>> totalRows = new List<List<string>>();
int maxCol = 0;
foreach (Row r in sheetData.Elements<Row>())
{
// Add the empty row.
string value = null;
while (totalRows.Count < r.RowIndex - 1)
{
List<string> emptyRowValues = new List<string>();
for (int i = 0; i < maxCol; i++)
{
emptyRowValues.Add("");
}
totalRows.Add(emptyRowValues);
}
List<string> tempRowValues = new List<string>();
foreach (Cell c in r.Elements<Cell>())
{
#region get the cell value of c.
if (c != null)
{
value = c.InnerText;
// If the cell represents a numeric value, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and Booleans
// individually. For shared strings, the code looks up the
// corresponding value in the shared string table. For Booleans,
// the code converts the value into the words TRUE or FALSE.
if (c.DataType != null)
{
switch (c.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared
// strings table.
var stringTable = workbookPart.
GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is
// wrong. Return the index that you found in the cell.
// Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.
ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
Console.Write(value + " ");
}
#endregion
// Add the cell to the row list.
int i = Convert.ToInt32(c.CellReference.ToString().ToCharArray().First() - 'A');
// Add the blank cell in the row.
while (tempRowValues.Count < i)
{
tempRowValues.Add("");
}
tempRowValues.Add(value);
}
// add the row to the totalRows.
maxCol = processList(tempRowValues, totalRows, maxCol);
Console.WriteLine();
}
table = ConvertListListStringToDataTable(totalRows, maxCol);
}
return table;
}
/// <summary>
/// Add each row to the totalRows.
/// </summary>
/// <param name="tempRows"></param>
/// <param name="totalRows"></param>
/// <param name="MaxCol">the max column number in rows of the totalRows</param>
/// <returns></returns>
private static int processList(List<string> tempRows, List<List<string>> totalRows, int MaxCol)
{
if (tempRows.Count > MaxCol)
{
MaxCol = tempRows.Count;
}
totalRows.Add(tempRows);
return MaxCol;
}
private static DataTable ConvertListListStringToDataTable(List<List<string>> totalRows, int maxCol)
{
DataTable table = new DataTable();
for (int i = 0; i < maxCol; i++)
{
table.Columns.Add();
}
foreach (List<string> row in totalRows)
{
while (row.Count < maxCol)
{
row.Add("");
}
table.Rows.Add(row.ToArray());
}
return table;
}
Is there an efficient way to change this code somewhere so that the read process can be little faster. How I can change this to code to read faster. Thanks.

I tried your code and noted that in an very easy example it took me approximately 4 secs to complete.
After editing my .xls file to your given details (columns: regional prefix, city, date, function, ...) and adding about 3,600 rows your code comes up to about 10 secs.
I think you should remove any Console.WriteLine statements as these ones slow down processing your xls file. After removing all of those my StopWatch showed 1.26 secs for the same number of rows.
You can find some reasons why console.WriteLine is so slow even on SO: Console.WriteLine slow. In this question there is an answer pointing to OutputDebugString...

I found some disadvantages in you code.
When add to DataTable large number of rows use BeginLoadData and EndLoadData
You need cache SharedStringTable
You should use OpenXmlReader (SAX method). Memory consumption will be reduced.
You can try my ExcelDataReader without these disadvantages. See here https://github.com/gSerP1983/OpenXml.Excel.Data
Read to DataTable example:
class Program
{
static void Main(string[] args)
{
var dt = new DataTable();
using (var reader = new ExcelDataReader(#"data.xlsx"))
{
dt.Load(reader);
}
Console.WriteLine("done: " + dt.Rows.Count);
Console.ReadKey();
}
}

Related

Reading excel file in c# using Microsoft DocumentFormat.OpenXml SDK

I am using Microsoft DocumentFormat.OpenXml SDK to read data from excel file.
While doing so I am taking into consideration if a cell has blank values(If Yes, read that too).
Now, facing issues with one of the excel sheets where the workSheet.SheetDimension is null hence the code is throwing an exception.
Code used :
class OpenXMLHelper
{
// A helper function to open an Excel file using OpenXML, and return a DataTable containing all the data from one
// of the worksheets.
//
// We've had lots of problems reading in Excel data using OLEDB (eg the ACE drivers no longer being present on new servers,
// OLEDB not working due to security issues, and blatantly ignoring blank rows at the top of worksheets), so this is a more
// stable method of reading in the data.
//
public static DataTable ExcelWorksheetToDataTable(string pathFilename)
{
try
{
DataTable dt = new DataTable();
string dimensions = string.Empty;
using (FileStream fs = new FileStream(pathFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (SpreadsheetDocument document = SpreadsheetDocument.Open(fs, false))
{
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
//Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
//--Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().FirstOrDefault();
//--if (theSheet == null)
//-- throw new Exception("Couldn't find the worksheet: "+ theSheet.Id);
// Retrieve a reference to the worksheet part.
//WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
//--WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
WorkbookPart workbookPart = document.WorkbookPart;
WorksheetPart wsPart = workbookPart.WorksheetParts.FirstOrDefault();
Worksheet workSheet = wsPart.Worksheet;
dimensions = workSheet.SheetDimension.Reference.InnerText; // Get the dimensions of this worksheet, eg "B2:F4"
int numOfColumns = 0;
int numOfRows = 0;
CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
//System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
string[,] cellValues = new string[numOfColumns, numOfRows];
int colInx = 0;
int rowInx = 0;
string value = "";
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
// Iterate through each row of OpenXML data, and store each cell's value in the appropriate slot in our [,] string array.
foreach (Row row in rows)
{
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
// *DON'T* assume there's going to be one XML element for each column in each row...
Cell cell = row.Descendants<Cell>().ElementAt(i);
if (cell.CellValue == null || cell.CellReference == null)
continue; // eg when an Excel cell contains a blank string
// Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
colInx = GetColumnIndexByName(cell.CellReference); // eg "C" -> 2 (0-based)
rowInx = GetRowIndexFromCellAddress(cell.CellReference) - 1; // Needs to be 0-based
// Fetch the value in this cell
value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
cellValues[colInx, rowInx] = value;
}
}
// Copy the array of strings into a DataTable.
// We don't (currently) make any attempt to work out which columns should be numeric, rather than string.
for (int col = 0; col < numOfColumns; col++)
{
//dt.Columns.Add("Column_" + col.ToString());
dt.Columns.Add(cellValues[col, 0]);
}
//foreach (Cell cell in rows.ElementAt(0))
//{
// dt.Columns.Add(GetCellValue(doc, cell));
//}
for (int row = 0; row < numOfRows; row++)
{
DataRow dataRow = dt.NewRow();
for (int col = 0; col < numOfColumns; col++)
{
dataRow.SetField(col, cellValues[col, row]);
}
dt.Rows.Add(dataRow);
}
dt.Rows.RemoveAt(0);
//#if DEBUG
// // Write out the contents of our DataTable to the Output window (for debugging)
// string str = "";
// for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
// {
// for (colInx = 0; colInx < maxNumOfColumns; colInx++)
// {
// object val = dt.Rows[rowInx].ItemArray[colInx];
// str += (val == null) ? "" : val.ToString();
// str += "\t";
// }
// str += "\n";
// }
// System.Diagnostics.Trace.WriteLine(str);
//#endif
return dt;
}
}
}
catch (Exception ex)
{
return null;
}
}
public static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
{
// How many columns & rows of data does this Worksheet contain ?
// We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
// eg "B1:F4" -> we'll need 6 columns and 4 rows.
//
// (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
try
{
string[] parts = dimensions.Split(':'); // eg "B1:F4"
if (parts.Length != 2)
throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");
numOfColumns = 1 + GetColumnIndexByName(parts[1]); // A=1, B=2, C=3 (1-based value), so F4 would return 6 columns
numOfRows = GetRowIndexFromCellAddress(parts[1]);
}
catch
{
throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
}
}
public static int GetRowIndexFromCellAddress(string cellAddress)
{
// Convert an Excel CellReference column into a 1-based row index
// eg "D42" -> 42
// "F123" -> 123
string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
return int.Parse(rowNumber);
}
public static int GetColumnIndexByName(string cellAddress)
{
// Convert an Excel CellReference column into a 0-based column index
// eg "D42" -> 3
// "F123" -> 5
var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
int number = 0, pow = 1;
for (int i = columnName.Length - 1; i >= 0; i--)
{
number += (columnName[i] - 'A' + 1) * pow;
pow *= 26;
}
return number - 1;
}
}[enter image description here][1]

The SheetDimension part is optional (and therefor you cannot always rely on it being up to date). See the following part of the OpenXML specification:
18.3.1.35 dimension (Worksheet Dimensions)
This element specifies the used range of the worksheet. It specifies the row and column bounds of
used cells in the worksheet. This is optional and is not required.
Used cells include cells with formulas, text content, and cell
formatting. When an entire column is formatted, only the first cell in
that column is considered used.
So an Excel file without any SheetDimension part is perfectly valid, so you should not rely on it being present in an Excel file.
Therefor I'd suggest to simply parse all Row elements contained in the SheetData part, and "count" the number of rows (instead of reading the SheetDimensions part to get the number of rows / columns). This way you can also take into account that an Excel file may contain completely blank rows in-between the data.

How to use Epplus with cells containing few rows

I want to import some excel file using epplus
the problem is that some cells contains more than one row (and that cause a problem
My excel look like this (in realite their is more tests (test2,test3....)
I can only get the first column by this algorithm..but it will be more complicated to get the seconde column
//this is the list than contain applications (column 2)
ICollection<Application> applications = new List<Application>();
int i = 0;
for (int j = workSheet.Dimension.Start.Row;
j <= workSheet.Dimension.End.Row;
j=i+1)
{
//this is the object that contain the first column
//and also a list of the second column (foreach domain thei `is a list of applications (column 2)`
Domaine domaine = new Domaine();
i += 1;
//add here and not last row
while (workSheet.Cells[i, 1].Text == "" && i < workSheet.Dimension.End.Row)
{
i++;
}
if (i > workSheet.Dimension.End.Row)
break;
domaine.NomDomaine = workSheet.Cells[i, 1].Text;
domaines.Add(domaine);
}
Edit : in other words is their a way to get the number of rows in one cell , OR a way to duplicate the value of each row in the cell
(for exemple if i have a cell from row 1 to 14 and the row number 5 have value)
how can i duplicate that text to all the rows (that will help me solving the problem)

Those are known as Merged cells. Values from merged cells are stored in the .Value property of the first cell in the merged range. This means we need to do just a little bit more work in order to read the value from a merged cell using EPPlus.
EPPlus provides us with a couple of properties that help us get to the correct reference though. Firstly we can use a cell's .Merge property to find out if it is part of a merged range. Then we can use the the worksheet's .MergedCells property to find the relevant range. It's then just a matter of finding the first cell in that range and returning the value.
So, in summary:
Determine if the cell we need to read from is part of a merged range using .Merge
If so, get the index of the merged range using the worksheet's .MergedCells property
Read the value from the first cell in the merged range
Putting this together we can derive a little helper method to take a worksheet object and row/col indices in order to return the value:
static string GetCellValueFromPossiblyMergedCell(ExcelWorksheet wks, int row, int col)
{
var cell = wks.Cells[row, col];
if (cell.Merge) //(1.)
{
var mergedId = wks.MergedCells[row, col]; //(2.)
return wks.Cells[mergedId].First().Value.ToString(); //(3.)
}
else
{
return cell.Value.ToString();
}
}
Worked example
If I have a domain class like this:
class ImportedRecord
{
public string ChildName { get; set; }
public string SubGroupName { get; set; }
public string GroupName { get; set; }
}
that I wanted to read from a spreadsheet that looked like this:
Then I could use this method:
static List<ImportedRecord> ImportRecords()
{
var ret = new List<ImportedRecord>();
var fInfo = new FileInfo(#"C:\temp\book1.xlsx");
using (var excel = new ExcelPackage(fInfo))
{
var wks = excel.Workbook.Worksheets["Sheet1"];
var lastRow = wks.Dimension.End.Row;
for (int i = 2; i <= lastRow; i++)
{
var importedRecord = new ImportedRecord
{
ChildName = wks.Cells[i, 4].Value.ToString(),
SubGroupName = GetCellValueFromPossiblyMergedCell(wks,i,3),
GroupName = GetCellValueFromPossiblyMergedCell(wks, i, 2)
};
ret.Add(importedRecord);
}
}
return ret;
}
static string GetCellValueFromPossiblyMergedCell(ExcelWorksheet wks, int row, int col)
{
var cell = wks.Cells[row, col];
if (cell.Merge)
{
var mergedId = wks.MergedCells[row, col];
return wks.Cells[mergedId].First().Value.ToString();
}
else
{
return cell.Value.ToString();
}
}

Update all objects where create new one

I have a Languages class in which i use the excellibrary. I have an .xls file in which i have three columns. The first is used to check if the key phrase is used in the document and then i have one column for each language i use. I would like to create a transactionField object for every row of the document. I try to do that but every time i create a new object all the objects that was created before take the values of the last object created. Camn you please explain me where i am wrong and how can i correct that issue?
This is where the mistake happen
TranslationField tnf = new TranslationField();
tnf.Used = false;
tnf.Strings = values;
Translations.Add(sKey, tnf);
public class Languages
{
public static bool Setup()
{
SupportedLanguages.Clear();
SupportedLanguages.Add(csDefaultLang);
try
{
Workbook book = Workbook.Load(sPath);
Worksheet sheet = book.Worksheets[0];
KeyStringHelper values = new KeyStringHelper();
TranslationNeedle tnl;
List<string> columns = new List<string>();
string sKey = "";
// traverse rows by Index
for (int rowIndex = sheet.Cells.FirstRowIndex; rowIndex <= sheet.Cells.LastRowIndex; rowIndex++)
{
Row row = sheet.Cells.GetRow(rowIndex);
row.FirstColIndex = 1;
for (int colIndex = row.FirstColIndex; colIndex <= row.LastColIndex; colIndex++)
{
Cell cell = row.GetCell(colIndex);
// the first excel row is assumed to be columns names
if (rowIndex == sheet.Cells.FirstRowIndex)
{
//Columns names correctly formatted
columns.Add(char.ToUpper(cell.StringValue[0]) + cell.StringValue.Substring(1).ToUpper());
//Register every language inside the xls
SupportedLanguages.Add(char.ToUpper(cell.StringValue[0]) + cell.StringValue.Substring(1).ToUpper());
}
else
{
if (colIndex - row.FirstColIndex == 0)
sKey = cell.StringValue.Replace("\r\n", "\n");
else
values.Add(columns[colIndex - row.FirstColIndex], cell.StringValue.Replace("\r\n", "\n"));
}
}
// add the cell values to Translations Dictionary
if (rowIndex != sheet.Cells.FirstRowIndex)
{
TranslationField tnf = new TranslationField();
tnf.Used = false;
tnf.Strings = values;
Translations.Add(sKey, tnf);
}
}
//other stuff
}
}
Here is the class TranslationField
class TranslationField
{
public bool Used = false;
public KeyStringHelper Strings = new KeyStringHelper();
}

You're reusing the same KeyStringHelper instance (values) for every TranslationField. So every TranslationField instance in your Translations collection is referencing the same KeyStringHelper instance.
It looks like you need to move the line
KeyStringHelper values = new KeyStringHelper();
inside the outer for loop.

C# OPEN XML: empty cells are getting skipped while getting data from EXCEL to DATATABLE

Task
Import data from excel to DataTable
Problem
The cell that doesnot contain any data are getting skipped and the very next cell that has data in the row is used as the value of the empty colum.
E.g
A1 is empty A2 has a value Tom then while importing the data A1 get the value of A2 and A2 remains empty
To make it very clear I am providing some screen shots below
This is the excel data
This is the DataTable after importing the data from excel
Code
public class ImportExcelOpenXml
{
public static DataTable Fill_dataTable(string fileName)
{
DataTable dt = new DataTable();
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
Worksheet workSheet = worksheetPart.Worksheet;
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach (Cell cell in rows.ElementAt(0))
{
dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
}
foreach (Row row in rows) //this will also include your header row...
{
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
}
dt.Rows.Add(tempRow);
}
}
dt.Rows.RemoveAt(0); //...so i'm taking it out here.
return dt;
}
public static string GetCellValue(SpreadsheetDocument document, Cell cell)
{
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
string value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
else
{
return value;
}
}
}
My Thoughts
I think there is some problem with
public IEnumerable<T> Descendants<T>() where T : OpenXmlElement;
In case I want the count of columns using Descendants
IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int colCnt = rows.ElementAt(0).Count();
OR
If I am getting the count of rows using Descendants
IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int rowCnt = rows.Count();`
In both cases Descendants is skipping the empty cells
Is there any alternative of Descendants.
Your suggestions are highly appreciated
P.S: I have also thought of getting the cells values by using column names like A1, A2 but in order to do that I will have to get the exact count of columns and rows which is not possible by using Descendants function.

Had there been some data in all the cells of a row then everything works fine. But if you happen to have even single empty cell in a row then things go haywire.
Why it is happening in first place?
The reason lies in below line of code:
row.Descendants<Cell>().Count()
Count() function gives you the number of non-empty cells in the row i.e. it will ignore all the empty cells while returning the count. So, when you pass row.Descendants<Cell>().ElementAt(i) as argument to GetCellValue method like this:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
Then, it will find the content of the next non-empty cell, not necessarily the content of the cell at column index i e.g. if the first column is empty and we call ElementAt(1), it returns the value in the second column instead and our program logic gets messed up.
Solution: We need to deal with the occurrence of empty cells in the row i.e. we need to figure out the actual/effective column index of the target cell in case there were some empty cells before it in the given row. So, you need to substitute your for loop code below:
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
}
with
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
Cell cell = row.Descendants<Cell>().ElementAt(i);
int actualCellIndex = CellReferenceToIndex(cell);
tempRow[actualCellIndex] = GetCellValue(spreadSheetDocument, cell);
}
Also, add below method in your code which is used in the above modified code snippet to obtain the actual/effective column index of any cell:
private static int CellReferenceToIndex(Cell cell)
{
int index = 0;
string reference = cell.CellReference.ToString().ToUpper();
foreach (char ch in reference)
{
if (Char.IsLetter(ch))
{
int value = (int)ch - (int)'A';
index = (index == 0) ? value : ((index + 1) * 26) + value;
}
else
{
return index;
}
}
return index;
}
Note: Index in an Excel row start with 1 unlike various programming languages where it starts at 0.

public void Read2007Xlsx()
{
try
{
DataTable dt = new DataTable();
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(#"D:\File.xlsx", false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
Worksheet workSheet = worksheetPart.Worksheet;
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach (Cell cell in rows.ElementAt(0))
{
dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
}
foreach (Row row in rows) //this will also include your header row...
{
DataRow tempRow = dt.NewRow();
int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
// Gets the column index of the cell with data
int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));
cellColumnIndex--; //zero based index
if (columnIndex < cellColumnIndex)
{
do
{
tempRow[columnIndex] = ""; //Insert blank data here;
columnIndex++;
}
while (columnIndex < cellColumnIndex);
}//end if block
tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);
columnIndex++;
}//end inner foreach loop
dt.Rows.Add(tempRow);
}//end outer foreach loop
}//end using block
dt.Rows.RemoveAt(0); //...so i'm taking it out here.
}//end try
catch (Exception ex)
{
}
}//end Read2007Xlsx method
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Create a regular expression to match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
} //end GetColumnName method
/// <summary>
/// Given just the column name (no row index), it will return the zero based column index.
/// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ).
/// A length of three can be implemented when needed.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful; otherwise null</returns>
public static int? GetColumnIndexFromName(string columnName)
{
//return columnIndex;
string name = columnName;
int number = 0;
int pow = 1;
for (int i = name.Length - 1; i >= 0; i--)
{
number += (name[i] - 'A' + 1) * pow;
pow *= 26;
}
return number;
} //end GetColumnIndexFromName method
public static string GetCellValue(SpreadsheetDocument document, Cell cell)
{
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
if (cell.CellValue ==null)
{
return "";
}
string value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
else
{
return value;
}
}//end GetCellValue method

foreach (Cell cell in row.Descendants<Cell>())
{
while (columnRef[i] + (dt.Rows.Count + 1) != cell.CellReference)
{
dt.Rows[dt.Rows.Count - 1][i] = "";
i += 1;
}
dt.Rows[dt.Rows.Count - 1][i] = GetValue(doc, cell);
i++;
}

Try this code. I have done little modifications and it worked for me:
public static DataTable Fill_dataTable(string filePath)
{
DataTable dt = new DataTable();
using (SpreadsheetDocument doc = SpreadsheetDocument.Open(filePath, false))
{
Sheet sheet = doc.WorkbookPart.Workbook.Sheets.GetFirstChild<Sheet>();
Worksheet worksheet = doc.WorkbookPart.GetPartById(sheet.Id.Value) as WorksheetPart.Worksheet;
IEnumerable<Row> rows = worksheet.GetFirstChild<SheetData>().Descendants<Row>();
DataTable dt = new DataTable();
List<string> columnRef = new List<string>();
foreach (Row row in rows)
{
if (row.RowIndex != null)
{
if (row.RowIndex.Value == 1)
{
foreach (Cell cell in row.Descendants<Cell>())
{
dt.Columns.Add(GetValue(doc, cell));
columnRef.Add(cell.CellReference.ToString().Substring(0, cell.CellReference.ToString().Length - 1));
}
}
else
{
dt.Rows.Add();
int i = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
while (columnRef(i) + dt.Rows.Count + 1 != cell.CellReference)
{
dt.Rows(dt.Rows.Count - 1)(i) = "";
i += 1;
}
dt.Rows(dt.Rows.Count - 1)(i) = GetValue(doc, cell);
i += 1;
}
}
}
}
}
return dt;
}

Having trouble reading excel file with the OpenXML sdk

I have a function that reads from an excel file and stores the results in a DataSet. I have another function that writes to an excel file. When I try to read from a regular human-generated excel file, the excel reading function returns a blank DataSet, but when I read from the excel file generated by the writing function, it works perfectly fine. The function then will not work on a regular generated excel file, even when I just copy and paste the contents of the function generated excel file. I finally tracked it down to this, but I have no idea where to go from here. Is there something wrong with my code?
Here is the excel generating function:
public static Boolean writeToExcel(string fileName, DataSet data)
{
Boolean answer = false;
using (SpreadsheetDocument excelDoc = SpreadsheetDocument.Create(tempPath + fileName, SpreadsheetDocumentType.Workbook))
{
WorkbookPart workbookPart = excelDoc.AddWorkbookPart();
workbookPart.Workbook = new Workbook();
WorksheetPart worksheetPart = workbookPart.AddNewPart<WorksheetPart>();
Sheets sheets = excelDoc.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());
Sheet sheet = new Sheet()
{
Id = excelDoc.WorkbookPart.GetIdOfPart(worksheetPart),
SheetId = 1,
Name = "Page1"
};
sheets.Append(sheet);
CreateWorkSheet(worksheetPart, data);
answer = true;
}
return answer;
}
private static void CreateWorkSheet(WorksheetPart worksheetPart, DataSet data)
{
Worksheet worksheet = new Worksheet();
SheetData sheetData = new SheetData();
UInt32Value currRowIndex = 1U;
int colIndex = 0;
Row excelRow;
DataTable table = data.Tables[0];
for (int rowIndex = -1; rowIndex < table.Rows.Count; rowIndex++)
{
excelRow = new Row();
excelRow.RowIndex = currRowIndex++;
for (colIndex = 0; colIndex < table.Columns.Count; colIndex++)
{
Cell cell = new Cell()
{
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
DataType = CellValues.String
};
CellValue cellValue = new CellValue();
if (rowIndex == -1)
{
cellValue.Text = table.Columns[colIndex].ColumnName.ToString();
}
else
{
cellValue.Text = (table.Rows[rowIndex].ItemArray[colIndex].ToString() != "") ? table.Rows[rowIndex].ItemArray[colIndex].ToString() : "*";
}
cell.Append(cellValue);
excelRow.Append(cell);
}
sheetData.Append(excelRow);
}
SheetFormatProperties formattingProps = new SheetFormatProperties()
{
DefaultColumnWidth = 20D,
DefaultRowHeight = 20D
};
worksheet.Append(formattingProps);
worksheet.Append(sheetData);
worksheetPart.Worksheet = worksheet;
}
while the reading function is as following:
public static void readInventoryExcel(string fileName, ref DataSet set)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
int count = -1;
foreach (Row r in sheetData.Elements<Row>())
{
if (count >= 0)
{
DataRow row = set.Tables[0].NewRow();
row["SerialNumber"] = r.ChildElements[1].InnerXml;
row["PartNumber"] = r.ChildElements[2].InnerXml;
row["EntryDate"] = r.ChildElements[3].InnerXml;
row["RetirementDate"] = r.ChildElements[4].InnerXml;
row["ReasonForReplacement"] = r.ChildElements[5].InnerXml;
row["RetirementTech"] = r.ChildElements[6].InnerXml;
row["IncludeInMaintenance"] = r.ChildElements[7].InnerXml;
row["MaintenanceTech"] = r.ChildElements[8].InnerXml;
row["Comment"] = r.ChildElements[9].InnerXml;
row["Station"] = r.ChildElements[10].InnerXml;
row["LocationStatus"] = r.ChildElements[11].InnerXml;
row["AssetName"] = r.ChildElements[12].InnerXml;
row["InventoryType"] = r.ChildElements[13].InnerXml;
row["Description"] = r.ChildElements[14].InnerXml;
set.Tables[0].Rows.Add(row);
}
count++;
}
}
}

I think this is caused by the fact that you have only one sheet whereas Excel has three. I'm not certain but I think the sheets are returned in reverse order so you should change the line:
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
to
WorksheetPart worksheetPart = workbookPart.WorksheetParts.Last();
It might be safer to search for the WorksheetPart if you can identify it by the sheet name. You need to find the Sheet first then use the Id of that to find the SheetPart:
private WorksheetPart GetWorksheetPartBySheetName(WorkbookPart workbookPart, string sheetName)
{
//find the sheet first.
IEnumerable<Sheet> sheets = workbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == sheetName);
if (sheets.Count() > 0)
{
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationshipId);
return worksheetPart;
}
return null;
}
You can then use:
WorksheetPart worksheetPart = GetWorksheetPartBySheetName(workbookPart, "Sheet1");
There are a couple of other things I've noticed whilst looking at your code which you may (or may not!) be interested in:
In your code you are only reading the InnerXml so it might not matter to you but the way Excel stores strings is different to the way you are writing them so reading an Excel generated file may not give you the values you expect. In your example you are writing the string directly to the cell like this:
But Excel uses a SharedStrings concept where all strings are written to a separate XML file called sharedStrings.xml. That file contains the strings used in the Excel file with a reference and it's that value that is stored in the cell value in the sheet XML.
The sharedString.xml looks like this:
And the Cell then looks like this:
The 47 in the <v> element is a reference to the 47th shared string. Note that the type (the t attribute) in your generated XML is str but the type in the Excel generated file is s. This denotes yours is an inline string and theirs is a shared string.
You can read the SharedStrings just as you would any other part:
var stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
if (stringTable != null)
{
sharedString = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
Secondly, if you look at the cell reference that your code generates and the cell reference that Excel generates you can see you are only outputting the column and not the row (e.g. you output A instead of A1). To fix this you should change the line:
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
to
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex) + rowIndex.ToString()),
I hope that helps.

I ran into a similar issue a while back trying to do this for Word documents (procedurally generated worked fine, but human-generated did not). I found this tool to be very helpful:
http://www.microsoft.com/en-us/download/details.aspx?id=30425
Basically, it looks at a file and shows you the code that Microsoft would generate to read it, as well as the xml structure of the file itself. As usual for Microsoft products, there are quite a few menus and it's not very intuitive, but after clicking around for a bit you will be able to see exactly what is going on with any two files. I would recommend you open a working excel file and a non-working one and compare the difference to see what's causing your issue.

Below is the OpenXML code that I use to read in a particular Worksheet from an Excel file, into a DataTable.
First, here's how you'd call it:
DataTable dt = OpenXMLHelper.ExcelWorksheetToDataTable("C:\\SQL Server\\SomeExcelFile.xlsx", "Mikes Worksheet");
And here's the code:
public class OpenXMLHelper
{
public static DataTable ExcelWorksheetToDataTable(string pathFilename, string worksheetName)
{
DataTable dt = new DataTable(worksheetName);
using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathFilename, false))
{
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
if (theSheet == null)
throw new Exception("Couldn't find the worksheet: " + worksheetName);
// Retrieve a reference to the worksheet part.
WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
Worksheet workSheet = wsPart.Worksheet;
string dimensions = workSheet.SheetDimension.Reference.InnerText; // Get the dimensions of this worksheet, eg "B2:F4"
int numOfColumns = 0;
int numOfRows = 0;
CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
string[,] cellValues = new string[numOfColumns, numOfRows];
int colInx = 0;
int rowInx = 0;
string value = "";
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
// Iterate through each row of OpenXML data
foreach (Row row in rows)
{
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
// *DON'T* assume there's going to be one XML element for each item in each row...
Cell cell = row.Descendants<Cell>().ElementAt(i);
if (cell.CellValue == null || cell.CellReference == null)
continue; // eg when an Excel cell contains a blank string
// Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
colInx = GetColumnIndexByName(cell.CellReference); // eg "C" -> 2 (0-based)
rowInx = GetRowIndexFromCellAddress(cell.CellReference)-1; // Needs to be 0-based
// Fetch the value in this cell
value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
cellValues[colInx, rowInx] = value;
}
dt.Rows.Add(dataRow);
}
// Copy the array of strings into a DataTable
for (int col = 0; col < numOfColumns; col++)
dt.Columns.Add("Column_" + col.ToString());
for (int row = 0; row < numOfRows; row++)
{
DataRow dataRow = dt.NewRow();
for (int col = 0; col < numOfColumns; col++)
{
dataRow.SetField(col, cellValues[col, row]);
}
dt.Rows.Add(dataRow);
}
#if DEBUG
// Write out the contents of our DataTable to the Output window (for debugging)
string str = "";
for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
{
for (colInx = 0; colInx < maxNumOfColumns; colInx++)
{
object val = dt.Rows[rowInx].ItemArray[colInx];
str += (val == null) ? "" : val.ToString();
str += "\t";
}
str += "\n";
}
System.Diagnostics.Trace.WriteLine(str);
#endif
return dt;
}
}
private static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
{
// How many columns & rows of data does this Worksheet contain ?
// We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
// eg "B1:F4" -> we'll need 6 columns and 4 rows.
//
// (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
try
{
string[] parts = dimensions.Split(':'); // eg "B1:F4"
if (parts.Length != 2)
throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");
numOfColumns = 1 + GetColumnIndexByName(parts[1]); // A=1, B=2, C=3 (1-based value), so F4 would return 6 columns
numOfRows = GetRowIndexFromCellAddress(parts[1]);
}
catch
{
throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
}
}
public static int GetRowIndexFromCellAddress(string cellAddress)
{
// Convert an Excel CellReference column into a 1-based row index
// eg "D42" -> 42
// "F123" -> 123
string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
return int.Parse(rowNumber);
}
public static int GetColumnIndexByName(string cellAddress)
{
// Convert an Excel CellReference column into a 0-based column index
// eg "D42" -> 3
// "F123" -> 5
var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
int number = 0, pow = 1;
for (int i = columnName.Length - 1; i >= 0; i--)
{
number += (columnName[i] - 'A' + 1) * pow;
pow *= 26;
}
return number - 1;
}
}
Just to mention, some of our company's Excel Worksheets have one or more blank rows at the top. Strangely, this prevented some other OpenXML libraries from reading in such Worksheets properly.
This code deliberately creates a DataTable with one value for each of the cells in the Worksheet, even the blank ones at the top.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read faster in OpenXML format - c#

Related

Reading excel file in c# using Microsoft DocumentFormat.OpenXml SDK

How to use Epplus with cells containing few rows

Update all objects where create new one

C# OPEN XML: empty cells are getting skipped while getting data from EXCEL to DATATABLE

Having trouble reading excel file with the OpenXML sdk

Categories

Resources