Reading excel file in c# using Microsoft DocumentFormat.OpenXml SDK - c#

I am using Microsoft DocumentFormat.OpenXml SDK to read data from excel file.
While doing so I am taking into consideration if a cell has blank values(If Yes, read that too).
Now, facing issues with one of the excel sheets where the workSheet.SheetDimension is null hence the code is throwing an exception.
Code used :
class OpenXMLHelper
{
// A helper function to open an Excel file using OpenXML, and return a DataTable containing all the data from one
// of the worksheets.
//
// We've had lots of problems reading in Excel data using OLEDB (eg the ACE drivers no longer being present on new servers,
// OLEDB not working due to security issues, and blatantly ignoring blank rows at the top of worksheets), so this is a more
// stable method of reading in the data.
//
public static DataTable ExcelWorksheetToDataTable(string pathFilename)
{
try
{
DataTable dt = new DataTable();
string dimensions = string.Empty;
using (FileStream fs = new FileStream(pathFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (SpreadsheetDocument document = SpreadsheetDocument.Open(fs, false))
{
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
//Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
//--Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().FirstOrDefault();
//--if (theSheet == null)
//-- throw new Exception("Couldn't find the worksheet: "+ theSheet.Id);
// Retrieve a reference to the worksheet part.
//WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
//--WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
WorkbookPart workbookPart = document.WorkbookPart;
WorksheetPart wsPart = workbookPart.WorksheetParts.FirstOrDefault();
Worksheet workSheet = wsPart.Worksheet;
dimensions = workSheet.SheetDimension.Reference.InnerText; // Get the dimensions of this worksheet, eg "B2:F4"
int numOfColumns = 0;
int numOfRows = 0;
CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
//System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
string[,] cellValues = new string[numOfColumns, numOfRows];
int colInx = 0;
int rowInx = 0;
string value = "";
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
// Iterate through each row of OpenXML data, and store each cell's value in the appropriate slot in our [,] string array.
foreach (Row row in rows)
{
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
// *DON'T* assume there's going to be one XML element for each column in each row...
Cell cell = row.Descendants<Cell>().ElementAt(i);
if (cell.CellValue == null || cell.CellReference == null)
continue; // eg when an Excel cell contains a blank string
// Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
colInx = GetColumnIndexByName(cell.CellReference); // eg "C" -> 2 (0-based)
rowInx = GetRowIndexFromCellAddress(cell.CellReference) - 1; // Needs to be 0-based
// Fetch the value in this cell
value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
cellValues[colInx, rowInx] = value;
}
}
// Copy the array of strings into a DataTable.
// We don't (currently) make any attempt to work out which columns should be numeric, rather than string.
for (int col = 0; col < numOfColumns; col++)
{
//dt.Columns.Add("Column_" + col.ToString());
dt.Columns.Add(cellValues[col, 0]);
}
//foreach (Cell cell in rows.ElementAt(0))
//{
// dt.Columns.Add(GetCellValue(doc, cell));
//}
for (int row = 0; row < numOfRows; row++)
{
DataRow dataRow = dt.NewRow();
for (int col = 0; col < numOfColumns; col++)
{
dataRow.SetField(col, cellValues[col, row]);
}
dt.Rows.Add(dataRow);
}
dt.Rows.RemoveAt(0);
//#if DEBUG
// // Write out the contents of our DataTable to the Output window (for debugging)
// string str = "";
// for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
// {
// for (colInx = 0; colInx < maxNumOfColumns; colInx++)
// {
// object val = dt.Rows[rowInx].ItemArray[colInx];
// str += (val == null) ? "" : val.ToString();
// str += "\t";
// }
// str += "\n";
// }
// System.Diagnostics.Trace.WriteLine(str);
//#endif
return dt;
}
}
}
catch (Exception ex)
{
return null;
}
}
public static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
{
// How many columns & rows of data does this Worksheet contain ?
// We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
// eg "B1:F4" -> we'll need 6 columns and 4 rows.
//
// (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
try
{
string[] parts = dimensions.Split(':'); // eg "B1:F4"
if (parts.Length != 2)
throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");
numOfColumns = 1 + GetColumnIndexByName(parts[1]); // A=1, B=2, C=3 (1-based value), so F4 would return 6 columns
numOfRows = GetRowIndexFromCellAddress(parts[1]);
}
catch
{
throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
}
}
public static int GetRowIndexFromCellAddress(string cellAddress)
{
// Convert an Excel CellReference column into a 1-based row index
// eg "D42" -> 42
// "F123" -> 123
string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
return int.Parse(rowNumber);
}
public static int GetColumnIndexByName(string cellAddress)
{
// Convert an Excel CellReference column into a 0-based column index
// eg "D42" -> 3
// "F123" -> 5
var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
int number = 0, pow = 1;
for (int i = columnName.Length - 1; i >= 0; i--)
{
number += (columnName[i] - 'A' + 1) * pow;
pow *= 26;
}
return number - 1;
}
}[enter image description here][1]

The SheetDimension part is optional (and therefor you cannot always rely on it being up to date). See the following part of the OpenXML specification:
18.3.1.35 dimension (Worksheet Dimensions)
This element specifies the used range of the worksheet. It specifies the row and column bounds of
used cells in the worksheet. This is optional and is not required.
Used cells include cells with formulas, text content, and cell
formatting. When an entire column is formatted, only the first cell in
that column is considered used.
So an Excel file without any SheetDimension part is perfectly valid, so you should not rely on it being present in an Excel file.
Therefor I'd suggest to simply parse all Row elements contained in the SheetData part, and "count" the number of rows (instead of reading the SheetDimensions part to get the number of rows / columns). This way you can also take into account that an Excel file may contain completely blank rows in-between the data.

Related

c#: Is there a way to retrieve the cell address in excel from where data begins?

I'm trying to copy excel data from one sheet to another. Its working fine but the problem is: In the source file if the data doesn't starts from cell A1 (consider the image below), in this case I want to copy data from the cell B5. Here Some header is not required. The actual data starts from Emp ID cell.
What I've tried is, I can provide a textbox to input the cell address into it and than start copying the data from the provided cell address. But this introduces manual intervention. I want it automated. Any help on this is appreciated. Thanks for the help.
Assuming some basic criteria, the following code should do it. The criteria I assume is: 1) if a row contains any merged cells (like your "Some Header") then that isn't the start row. 2) the start cell will contain text in the cell to the right and in the cell below it.
private static bool RowIsEmpty(Range range)
{
foreach (object obj in (object[,])range.Value2)
{
if (obj != null && obj.ToString() != "")
{
return false;
}
}
return true;
}
private static bool CellIsEmpty(Range cell)
{
if (cell.Value2 != null && cell.Value2.ToString() != "")
{
return false;
}
return true;
}
private Tuple<int, int> ExcelFindStartCell()
{
var excelApp = new Microsoft.Office.Interop.Excel.Application();
excelApp.Visible = true;
Workbook workbook = excelApp.Workbooks.Open("test.xlsx");
Worksheet worksheet = excelApp.ActiveSheet;
// Go through each row.
for (int row = 1; row < worksheet.Rows.Count; row++)
{
Range range = worksheet.Rows[row];
// Check if the row is empty.
if (RowIsEmpty(range))
{
continue;
}
// Check if the row contains any merged cells, if so we'll assume it's
// some kind of header and move on.
object mergedCells = range.MergeCells;
if (mergedCells == DBNull.Value || (bool)mergedCells)
{
continue;
}
// Find the first column that contains text in this row.
for (int col = 1; col < range.Columns.Count; col++)
{
Range cell = range.Cells[1, col];
if (CellIsEmpty(cell))
{
continue;
}
// Now check if the cell to the right also contains text.
Range rightCell = worksheet.Cells[row, col + 1];
if (CellIsEmpty(rightCell))
{
// No text in right cell, try the next row.
break;
}
// Now check if cell below also contains text.
Range bottomCell = worksheet.Cells[row + 1, col];
if (CellIsEmpty(bottomCell))
{
// No text in bottom cell, try the next row.
break;
}
// Success!
workbook.Close();
excelApp.Quit();
return new Tuple<int, int>(row, col);
}
}
// Didn't find anything that matched the criteria.
workbook.Close();
excelApp.Quit();
return null;
}

C# OPEN XML: empty cells are getting skipped while getting data from EXCEL to DATATABLE

Task
Import data from excel to DataTable
Problem
The cell that doesnot contain any data are getting skipped and the very next cell that has data in the row is used as the value of the empty colum.
E.g
A1 is empty A2 has a value Tom then while importing the data A1 get the value of A2 and A2 remains empty
To make it very clear I am providing some screen shots below
This is the excel data
This is the DataTable after importing the data from excel
Code
public class ImportExcelOpenXml
{
public static DataTable Fill_dataTable(string fileName)
{
DataTable dt = new DataTable();
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
Worksheet workSheet = worksheetPart.Worksheet;
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach (Cell cell in rows.ElementAt(0))
{
dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
}
foreach (Row row in rows) //this will also include your header row...
{
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
}
dt.Rows.Add(tempRow);
}
}
dt.Rows.RemoveAt(0); //...so i'm taking it out here.
return dt;
}
public static string GetCellValue(SpreadsheetDocument document, Cell cell)
{
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
string value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
else
{
return value;
}
}
}
My Thoughts
I think there is some problem with
public IEnumerable<T> Descendants<T>() where T : OpenXmlElement;
In case I want the count of columns using Descendants
IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int colCnt = rows.ElementAt(0).Count();
OR
If I am getting the count of rows using Descendants
IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int rowCnt = rows.Count();`
In both cases Descendants is skipping the empty cells
Is there any alternative of Descendants.
Your suggestions are highly appreciated
P.S: I have also thought of getting the cells values by using column names like A1, A2 but in order to do that I will have to get the exact count of columns and rows which is not possible by using Descendants function.
Had there been some data in all the cells of a row then everything works fine. But if you happen to have even single empty cell in a row then things go haywire.
Why it is happening in first place?
The reason lies in below line of code:
row.Descendants<Cell>().Count()
Count() function gives you the number of non-empty cells in the row i.e. it will ignore all the empty cells while returning the count. So, when you pass row.Descendants<Cell>().ElementAt(i) as argument to GetCellValue method like this:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
Then, it will find the content of the next non-empty cell, not necessarily the content of the cell at column index i e.g. if the first column is empty and we call ElementAt(1), it returns the value in the second column instead and our program logic gets messed up.
Solution: We need to deal with the occurrence of empty cells in the row i.e. we need to figure out the actual/effective column index of the target cell in case there were some empty cells before it in the given row. So, you need to substitute your for loop code below:
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
}
with
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
Cell cell = row.Descendants<Cell>().ElementAt(i);
int actualCellIndex = CellReferenceToIndex(cell);
tempRow[actualCellIndex] = GetCellValue(spreadSheetDocument, cell);
}
Also, add below method in your code which is used in the above modified code snippet to obtain the actual/effective column index of any cell:
private static int CellReferenceToIndex(Cell cell)
{
int index = 0;
string reference = cell.CellReference.ToString().ToUpper();
foreach (char ch in reference)
{
if (Char.IsLetter(ch))
{
int value = (int)ch - (int)'A';
index = (index == 0) ? value : ((index + 1) * 26) + value;
}
else
{
return index;
}
}
return index;
}
Note: Index in an Excel row start with 1 unlike various programming languages where it starts at 0.
public void Read2007Xlsx()
{
try
{
DataTable dt = new DataTable();
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(#"D:\File.xlsx", false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
Worksheet workSheet = worksheetPart.Worksheet;
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach (Cell cell in rows.ElementAt(0))
{
dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
}
foreach (Row row in rows) //this will also include your header row...
{
DataRow tempRow = dt.NewRow();
int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
// Gets the column index of the cell with data
int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));
cellColumnIndex--; //zero based index
if (columnIndex < cellColumnIndex)
{
do
{
tempRow[columnIndex] = ""; //Insert blank data here;
columnIndex++;
}
while (columnIndex < cellColumnIndex);
}//end if block
tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);
columnIndex++;
}//end inner foreach loop
dt.Rows.Add(tempRow);
}//end outer foreach loop
}//end using block
dt.Rows.RemoveAt(0); //...so i'm taking it out here.
}//end try
catch (Exception ex)
{
}
}//end Read2007Xlsx method
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Create a regular expression to match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
} //end GetColumnName method
/// <summary>
/// Given just the column name (no row index), it will return the zero based column index.
/// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ).
/// A length of three can be implemented when needed.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful; otherwise null</returns>
public static int? GetColumnIndexFromName(string columnName)
{
//return columnIndex;
string name = columnName;
int number = 0;
int pow = 1;
for (int i = name.Length - 1; i >= 0; i--)
{
number += (name[i] - 'A' + 1) * pow;
pow *= 26;
}
return number;
} //end GetColumnIndexFromName method
public static string GetCellValue(SpreadsheetDocument document, Cell cell)
{
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
if (cell.CellValue ==null)
{
return "";
}
string value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
else
{
return value;
}
}//end GetCellValue method
foreach (Cell cell in row.Descendants<Cell>())
{
while (columnRef[i] + (dt.Rows.Count + 1) != cell.CellReference)
{
dt.Rows[dt.Rows.Count - 1][i] = "";
i += 1;
}
dt.Rows[dt.Rows.Count - 1][i] = GetValue(doc, cell);
i++;
}
Try this code. I have done little modifications and it worked for me:
public static DataTable Fill_dataTable(string filePath)
{
DataTable dt = new DataTable();
using (SpreadsheetDocument doc = SpreadsheetDocument.Open(filePath, false))
{
Sheet sheet = doc.WorkbookPart.Workbook.Sheets.GetFirstChild<Sheet>();
Worksheet worksheet = doc.WorkbookPart.GetPartById(sheet.Id.Value) as WorksheetPart.Worksheet;
IEnumerable<Row> rows = worksheet.GetFirstChild<SheetData>().Descendants<Row>();
DataTable dt = new DataTable();
List<string> columnRef = new List<string>();
foreach (Row row in rows)
{
if (row.RowIndex != null)
{
if (row.RowIndex.Value == 1)
{
foreach (Cell cell in row.Descendants<Cell>())
{
dt.Columns.Add(GetValue(doc, cell));
columnRef.Add(cell.CellReference.ToString().Substring(0, cell.CellReference.ToString().Length - 1));
}
}
else
{
dt.Rows.Add();
int i = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
while (columnRef(i) + dt.Rows.Count + 1 != cell.CellReference)
{
dt.Rows(dt.Rows.Count - 1)(i) = "";
i += 1;
}
dt.Rows(dt.Rows.Count - 1)(i) = GetValue(doc, cell);
i += 1;
}
}
}
}
}
return dt;
}

OpenXML how to get cell in range

Please help me to get cell in range (ex from A:1 to E:11 are all cells in rectangular).
For now, my ideal is
Worksheet worksheet = GetWorksheet(document, sheetName);
SheetData sheetData = worksheet.GetFirstChild<SheetData>();
IEnumerable<Cell> cells = sheetData.Descendants<Cell>().Where(c =>
c.CellReference >= A:1 &&
c.CellReference <= E:11 &&
);
int t = cells.Count();
But this code does not work.
Thanks
It won't be that easy to compare cell's CellReference with a string. And yes, what you are currently doing is wrong. You simply cannot compare strings for Higher or Lower in such a way.
You have two options.
Option 1 :
You can take cell reference and break it down. That means separate characters and numbers and then give them values individually and compare
A1 - > A and 1 -> Give A =1 so you have 1 and 1
E11 -> E and 11 -> Give E = 5 so you have 5 and 11
So you will need to breakdown the CellReference and check the validity for your requirement.
Option 2 :
If you notice above it's simply we take a 2D matrix index (ex : 1,1 and 5,11 which are COLUMN,ROW format). You can simply use this feature in comparison. But catch is you cannot use LINQ for this, you need to iterate through rows and columns. I tried to give following example code, try it
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open("PATH", true))
{
//Get workbookpart
WorkbookPart workbookPart = myDoc.WorkbookPart;
// Extract the workbook part
var stringtable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
//then access to the worksheet part
IEnumerable<WorksheetPart> worksheetPart = workbookPart.WorksheetParts;
foreach (WorksheetPart WSP in worksheetPart)
{
//find sheet data
IEnumerable<SheetData> sheetData = WSP.Worksheet.Elements<SheetData>();
int RowCount = 0;
int CellCount = 0;
// This is A1
int RowMin = 1;
int ColMin = 1;
//This is E11
int RowMax = 11;
int ColMax = 5;
foreach (SheetData SD in sheetData)
{
foreach (Row row in SD.Elements<Row>())
{
RowCount++; // We are in a new row
// For each cell we need to identify type
foreach (Cell cell in row.Elements<Cell>())
{
// We are in a new Cell
CellCount++;
if ((RowCount >= RowMin && CellCount >= ColMin) && (RowCount <= RowMax && CellCount <= ColMax))
{
if (cell.DataType == null && cell.CellValue != null)
{
// Check for pure numbers
Console.WriteLine(cell.CellValue.Text);
}
else if (cell.DataType.Value == CellValues.Boolean)
{
// Booleans
Console.WriteLine(cell.CellValue.Text);
}
else if (cell.CellValue != null)
{
// A shared string
if (stringtable != null)
{
// Cell value holds the shared string location
Console.WriteLine(stringtable.SharedStringTable.ElementAt(int.Parse(cell.CellValue.Text)).InnerText);
}
}
else
{
Console.WriteLine("A broken book");
}
}
}
// Reset Cell count
CellCount = 0;
}
}
}
}
This actually work. I tested.

Having trouble reading excel file with the OpenXML sdk

I have a function that reads from an excel file and stores the results in a DataSet. I have another function that writes to an excel file. When I try to read from a regular human-generated excel file, the excel reading function returns a blank DataSet, but when I read from the excel file generated by the writing function, it works perfectly fine. The function then will not work on a regular generated excel file, even when I just copy and paste the contents of the function generated excel file. I finally tracked it down to this, but I have no idea where to go from here. Is there something wrong with my code?
Here is the excel generating function:
public static Boolean writeToExcel(string fileName, DataSet data)
{
Boolean answer = false;
using (SpreadsheetDocument excelDoc = SpreadsheetDocument.Create(tempPath + fileName, SpreadsheetDocumentType.Workbook))
{
WorkbookPart workbookPart = excelDoc.AddWorkbookPart();
workbookPart.Workbook = new Workbook();
WorksheetPart worksheetPart = workbookPart.AddNewPart<WorksheetPart>();
Sheets sheets = excelDoc.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());
Sheet sheet = new Sheet()
{
Id = excelDoc.WorkbookPart.GetIdOfPart(worksheetPart),
SheetId = 1,
Name = "Page1"
};
sheets.Append(sheet);
CreateWorkSheet(worksheetPart, data);
answer = true;
}
return answer;
}
private static void CreateWorkSheet(WorksheetPart worksheetPart, DataSet data)
{
Worksheet worksheet = new Worksheet();
SheetData sheetData = new SheetData();
UInt32Value currRowIndex = 1U;
int colIndex = 0;
Row excelRow;
DataTable table = data.Tables[0];
for (int rowIndex = -1; rowIndex < table.Rows.Count; rowIndex++)
{
excelRow = new Row();
excelRow.RowIndex = currRowIndex++;
for (colIndex = 0; colIndex < table.Columns.Count; colIndex++)
{
Cell cell = new Cell()
{
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
DataType = CellValues.String
};
CellValue cellValue = new CellValue();
if (rowIndex == -1)
{
cellValue.Text = table.Columns[colIndex].ColumnName.ToString();
}
else
{
cellValue.Text = (table.Rows[rowIndex].ItemArray[colIndex].ToString() != "") ? table.Rows[rowIndex].ItemArray[colIndex].ToString() : "*";
}
cell.Append(cellValue);
excelRow.Append(cell);
}
sheetData.Append(excelRow);
}
SheetFormatProperties formattingProps = new SheetFormatProperties()
{
DefaultColumnWidth = 20D,
DefaultRowHeight = 20D
};
worksheet.Append(formattingProps);
worksheet.Append(sheetData);
worksheetPart.Worksheet = worksheet;
}
while the reading function is as following:
public static void readInventoryExcel(string fileName, ref DataSet set)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
int count = -1;
foreach (Row r in sheetData.Elements<Row>())
{
if (count >= 0)
{
DataRow row = set.Tables[0].NewRow();
row["SerialNumber"] = r.ChildElements[1].InnerXml;
row["PartNumber"] = r.ChildElements[2].InnerXml;
row["EntryDate"] = r.ChildElements[3].InnerXml;
row["RetirementDate"] = r.ChildElements[4].InnerXml;
row["ReasonForReplacement"] = r.ChildElements[5].InnerXml;
row["RetirementTech"] = r.ChildElements[6].InnerXml;
row["IncludeInMaintenance"] = r.ChildElements[7].InnerXml;
row["MaintenanceTech"] = r.ChildElements[8].InnerXml;
row["Comment"] = r.ChildElements[9].InnerXml;
row["Station"] = r.ChildElements[10].InnerXml;
row["LocationStatus"] = r.ChildElements[11].InnerXml;
row["AssetName"] = r.ChildElements[12].InnerXml;
row["InventoryType"] = r.ChildElements[13].InnerXml;
row["Description"] = r.ChildElements[14].InnerXml;
set.Tables[0].Rows.Add(row);
}
count++;
}
}
}
I think this is caused by the fact that you have only one sheet whereas Excel has three. I'm not certain but I think the sheets are returned in reverse order so you should change the line:
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
to
WorksheetPart worksheetPart = workbookPart.WorksheetParts.Last();
It might be safer to search for the WorksheetPart if you can identify it by the sheet name. You need to find the Sheet first then use the Id of that to find the SheetPart:
private WorksheetPart GetWorksheetPartBySheetName(WorkbookPart workbookPart, string sheetName)
{
//find the sheet first.
IEnumerable<Sheet> sheets = workbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == sheetName);
if (sheets.Count() > 0)
{
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationshipId);
return worksheetPart;
}
return null;
}
You can then use:
WorksheetPart worksheetPart = GetWorksheetPartBySheetName(workbookPart, "Sheet1");
There are a couple of other things I've noticed whilst looking at your code which you may (or may not!) be interested in:
In your code you are only reading the InnerXml so it might not matter to you but the way Excel stores strings is different to the way you are writing them so reading an Excel generated file may not give you the values you expect. In your example you are writing the string directly to the cell like this:
But Excel uses a SharedStrings concept where all strings are written to a separate XML file called sharedStrings.xml. That file contains the strings used in the Excel file with a reference and it's that value that is stored in the cell value in the sheet XML.
The sharedString.xml looks like this:
And the Cell then looks like this:
The 47 in the <v> element is a reference to the 47th shared string. Note that the type (the t attribute) in your generated XML is str but the type in the Excel generated file is s. This denotes yours is an inline string and theirs is a shared string.
You can read the SharedStrings just as you would any other part:
var stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
if (stringTable != null)
{
sharedString = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
Secondly, if you look at the cell reference that your code generates and the cell reference that Excel generates you can see you are only outputting the column and not the row (e.g. you output A instead of A1). To fix this you should change the line:
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
to
CellReference = Convert.ToString(Convert.ToChar(65 + colIndex) + rowIndex.ToString()),
I hope that helps.
I ran into a similar issue a while back trying to do this for Word documents (procedurally generated worked fine, but human-generated did not). I found this tool to be very helpful:
http://www.microsoft.com/en-us/download/details.aspx?id=30425
Basically, it looks at a file and shows you the code that Microsoft would generate to read it, as well as the xml structure of the file itself. As usual for Microsoft products, there are quite a few menus and it's not very intuitive, but after clicking around for a bit you will be able to see exactly what is going on with any two files. I would recommend you open a working excel file and a non-working one and compare the difference to see what's causing your issue.
Below is the OpenXML code that I use to read in a particular Worksheet from an Excel file, into a DataTable.
First, here's how you'd call it:
DataTable dt = OpenXMLHelper.ExcelWorksheetToDataTable("C:\\SQL Server\\SomeExcelFile.xlsx", "Mikes Worksheet");
And here's the code:
public class OpenXMLHelper
{
public static DataTable ExcelWorksheetToDataTable(string pathFilename, string worksheetName)
{
DataTable dt = new DataTable(worksheetName);
using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathFilename, false))
{
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
if (theSheet == null)
throw new Exception("Couldn't find the worksheet: " + worksheetName);
// Retrieve a reference to the worksheet part.
WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
Worksheet workSheet = wsPart.Worksheet;
string dimensions = workSheet.SheetDimension.Reference.InnerText; // Get the dimensions of this worksheet, eg "B2:F4"
int numOfColumns = 0;
int numOfRows = 0;
CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
string[,] cellValues = new string[numOfColumns, numOfRows];
int colInx = 0;
int rowInx = 0;
string value = "";
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
// Iterate through each row of OpenXML data
foreach (Row row in rows)
{
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
// *DON'T* assume there's going to be one XML element for each item in each row...
Cell cell = row.Descendants<Cell>().ElementAt(i);
if (cell.CellValue == null || cell.CellReference == null)
continue; // eg when an Excel cell contains a blank string
// Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
colInx = GetColumnIndexByName(cell.CellReference); // eg "C" -> 2 (0-based)
rowInx = GetRowIndexFromCellAddress(cell.CellReference)-1; // Needs to be 0-based
// Fetch the value in this cell
value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
cellValues[colInx, rowInx] = value;
}
dt.Rows.Add(dataRow);
}
// Copy the array of strings into a DataTable
for (int col = 0; col < numOfColumns; col++)
dt.Columns.Add("Column_" + col.ToString());
for (int row = 0; row < numOfRows; row++)
{
DataRow dataRow = dt.NewRow();
for (int col = 0; col < numOfColumns; col++)
{
dataRow.SetField(col, cellValues[col, row]);
}
dt.Rows.Add(dataRow);
}
#if DEBUG
// Write out the contents of our DataTable to the Output window (for debugging)
string str = "";
for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
{
for (colInx = 0; colInx < maxNumOfColumns; colInx++)
{
object val = dt.Rows[rowInx].ItemArray[colInx];
str += (val == null) ? "" : val.ToString();
str += "\t";
}
str += "\n";
}
System.Diagnostics.Trace.WriteLine(str);
#endif
return dt;
}
}
private static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
{
// How many columns & rows of data does this Worksheet contain ?
// We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
// eg "B1:F4" -> we'll need 6 columns and 4 rows.
//
// (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
try
{
string[] parts = dimensions.Split(':'); // eg "B1:F4"
if (parts.Length != 2)
throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");
numOfColumns = 1 + GetColumnIndexByName(parts[1]); // A=1, B=2, C=3 (1-based value), so F4 would return 6 columns
numOfRows = GetRowIndexFromCellAddress(parts[1]);
}
catch
{
throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
}
}
public static int GetRowIndexFromCellAddress(string cellAddress)
{
// Convert an Excel CellReference column into a 1-based row index
// eg "D42" -> 42
// "F123" -> 123
string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
return int.Parse(rowNumber);
}
public static int GetColumnIndexByName(string cellAddress)
{
// Convert an Excel CellReference column into a 0-based column index
// eg "D42" -> 3
// "F123" -> 5
var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
int number = 0, pow = 1;
for (int i = columnName.Length - 1; i >= 0; i--)
{
number += (columnName[i] - 'A' + 1) * pow;
pow *= 26;
}
return number - 1;
}
}
Just to mention, some of our company's Excel Worksheets have one or more blank rows at the top. Strangely, this prevented some other OpenXML libraries from reading in such Worksheets properly.
This code deliberately creates a DataTable with one value for each of the cells in the Worksheet, even the blank ones at the top.

How to read faster in OpenXML format

When I used OLEDB, it takes only 2 - 3 seconds to read 3200 rows from an Excel Sheet. Now I changed to OpenXML format and now it takes more than 1 minute to read 3200 rows from an Excel Sheet.
Below is my code:
public static DataTable ReadExcelFileDOM(string filename)
{
DataTable table;
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true))
{
WorkbookPart workbookPart = myDoc.WorkbookPart;
Sheet worksheet = workbookPart.Workbook.Descendants<Sheet>().First();
WorksheetPart worksheetPart =
(WorksheetPart)(workbookPart.GetPartById(worksheet.Id));
SheetData sheetData =
worksheetPart.Worksheet.Elements<SheetData>().First();
List<List<string>> totalRows = new List<List<string>>();
int maxCol = 0;
foreach (Row r in sheetData.Elements<Row>())
{
// Add the empty row.
string value = null;
while (totalRows.Count < r.RowIndex - 1)
{
List<string> emptyRowValues = new List<string>();
for (int i = 0; i < maxCol; i++)
{
emptyRowValues.Add("");
}
totalRows.Add(emptyRowValues);
}
List<string> tempRowValues = new List<string>();
foreach (Cell c in r.Elements<Cell>())
{
#region get the cell value of c.
if (c != null)
{
value = c.InnerText;
// If the cell represents a numeric value, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and Booleans
// individually. For shared strings, the code looks up the
// corresponding value in the shared string table. For Booleans,
// the code converts the value into the words TRUE or FALSE.
if (c.DataType != null)
{
switch (c.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared
// strings table.
var stringTable = workbookPart.
GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is
// wrong. Return the index that you found in the cell.
// Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.
ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
Console.Write(value + " ");
}
#endregion
// Add the cell to the row list.
int i = Convert.ToInt32(c.CellReference.ToString().ToCharArray().First() - 'A');
// Add the blank cell in the row.
while (tempRowValues.Count < i)
{
tempRowValues.Add("");
}
tempRowValues.Add(value);
}
// add the row to the totalRows.
maxCol = processList(tempRowValues, totalRows, maxCol);
Console.WriteLine();
}
table = ConvertListListStringToDataTable(totalRows, maxCol);
}
return table;
}
/// <summary>
/// Add each row to the totalRows.
/// </summary>
/// <param name="tempRows"></param>
/// <param name="totalRows"></param>
/// <param name="MaxCol">the max column number in rows of the totalRows</param>
/// <returns></returns>
private static int processList(List<string> tempRows, List<List<string>> totalRows, int MaxCol)
{
if (tempRows.Count > MaxCol)
{
MaxCol = tempRows.Count;
}
totalRows.Add(tempRows);
return MaxCol;
}
private static DataTable ConvertListListStringToDataTable(List<List<string>> totalRows, int maxCol)
{
DataTable table = new DataTable();
for (int i = 0; i < maxCol; i++)
{
table.Columns.Add();
}
foreach (List<string> row in totalRows)
{
while (row.Count < maxCol)
{
row.Add("");
}
table.Rows.Add(row.ToArray());
}
return table;
}
Is there an efficient way to change this code somewhere so that the read process can be little faster. How I can change this to code to read faster. Thanks.
I tried your code and noted that in an very easy example it took me approximately 4 secs to complete.
After editing my .xls file to your given details (columns: regional prefix, city, date, function, ...) and adding about 3,600 rows your code comes up to about 10 secs.
I think you should remove any Console.WriteLine statements as these ones slow down processing your xls file. After removing all of those my StopWatch showed 1.26 secs for the same number of rows.
You can find some reasons why console.WriteLine is so slow even on SO: Console.WriteLine slow. In this question there is an answer pointing to OutputDebugString...
I found some disadvantages in you code.
When add to DataTable large number of rows use BeginLoadData and EndLoadData
You need cache SharedStringTable
You should use OpenXmlReader (SAX method). Memory consumption will be reduced.
You can try my ExcelDataReader without these disadvantages. See here https://github.com/gSerP1983/OpenXml.Excel.Data
Read to DataTable example:
class Program
{
static void Main(string[] args)
{
var dt = new DataTable();
using (var reader = new ExcelDataReader(#"data.xlsx"))
{
dt.Load(reader);
}
Console.WriteLine("done: " + dt.Rows.Count);
Console.ReadKey();
}
}

Categories

Resources