Unexpected results using Excel Data Reader

Unexpected results using Excel Data Reader - c#

I'm reading an XLSX (Microsoft Excel XML file) using the Excel Data Reader from http://exceldatareader.codeplex.com/ and I getting some unexpected results.
The following code outputs data from multiple tabs
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
On a single line, I can get data from 3 or more tabs.
I would expect this to loop through and show all of the contents of the first tab and only the first tab.
What am I missing?
Update: It appears that the above code does work fine if there is only 1 tab in the excel file. This may just be a bug with this library. Has anyone else used this library to parse excel files with multiple tabs?
Thanks

OK, so my reply is extremely late with reference to this question, but if its any help try encapsulating your code in a reader.NextResult() block. This works the same way as when you parse through multiple DataTable objects within a DataSet.
Additionally, this approach has a very small memory footprint as opposed to the reader.AsDataSet() method, which hogs a lot of memory even for workbooks as small as 20MBs
eg
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
do
{
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
}while(reader.NextResult());

Which is why I am using NPOI. I have tried several other Excel readers, this one actually worked for me.

Related

Database (Excel) Access Speed - Using Open XML SDK in Visual Studio C# (DOM Approach)

I mostly write number-crunching programs using Visual Studio C# (2019) where I am simply taking input data, calculating results and displaying it. No complicated Network or Internet programming. Think first or second college level programming coarse from the early 1990's.
For inputs I was reading in data from an excel file using the following directive:
using Excel = Microsoft.Office.Interop.Excel;
This proved to be very slow when executing the program. I then learned this way of accessing an Excel file is no longer supported and has been superseded by Open XML SDK. Please see the following link to the Microsoft Dev Center page:
https://learn.microsoft.com/en-us/office/open-xml/how-to-parse-and-read-a-large-spreadsheet
For what I want to do the Document Object Model(DOM) approach seems most appropriate for the thousands of individual excel cells I want to read as input data. However, the Microsoft Dev Center is certainly not the most user-friendly resource and the code example provided for reading an Excel file using this DOM approach is writing to a console which I'm not using. I never did get my code to work.
Long and short of it is, I got my code working using the GetCellValue Method:
https://learn.microsoft.com/en-us/office/open-xml/how-to-retrieve-the-values-of-cells-in-a-spreadsheet
However, this 'GetCellValue' method is still taking way too long. I need to read in thousands or tens of thousands of Excel input data cells in seconds or fractions of seconds not 20 seconds to a minute.
I think if I had an example of the DOM method reading in Excel data to an Array Variable (instead of writing to the console) it would help. Can anyone provide an example of such code?
Below I have included my code example where I modified the DOM approach code copied from the Microsoft Office Dev Center to write values from a source Excel File to a DataGrid instead of the Console used by the Dev Center code:
C#
// The DOM approach.
// Note that the code below works only for cells that contain numeric values.
//
public void ReadExcelFileDOM(string fileName)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
DataGridView_Vessel.Rows.Clear();
DataGridView_Vessel.Refresh();
string text;
int File_Row = 0;
int File_Cell = 0;
foreach (Row r in sheetData.Elements<Row>())
{
DataGridView_Vessel.Rows.Add();
foreach (Cell c in r.Elements<Cell>())
{
if (c.CellValue == null)
{
File_Cell++;
//continue;
}
else
{
text = c.CellValue.Text;
if(File_Cell<12)
{
DataGridView_Vessel.Rows[File_Row].Cells[File_Cell].Value = text;
}
File_Cell++;
}
}
File_Row++;
}
//Console.WriteLine();
//Console.ReadKey();
}
}

How to process extremely large .xlsx files with C#

Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.
Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.

For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}

Interop Excel is slow

I am writing an application to open an Excel sheet and read it
MyApp = new Excel.Application();
MyBook = MyApp.Workbooks.Open(filename);
MySheet = (Excel.Worksheet)MyBook.Sheets[1]; // Explict cast is not required here
lastRow = MySheet.Cells.SpecialCells(Excel.XlCellType.xlCellTypeLastCell).Row;
MyApp.Visible = false;
It takes about 6-7 seconds for this to take place, is this normal with interop Excel?
Also is there a quicker way to Read an Excel than this?
string[] xx = new string[lastRow];
for (int index = 1; index <= lastRow; index++)
{
int maxCol = endCol - startCol;
for (int j = 1; j <= maxCol; j++)
{
try
{
xx[index - 1] += (MySheet.Cells[index, j] as Excel.Range).Value2.ToString();
}
catch
{
}
if (j != maxCol) xx[index - 1] += "|";
}
}
MyApp.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(MySheet);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyBook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyApp);

Appending to the answer of #RvdK - yes COM interop is slow.
Why is it slow?
It is due to the fact how it works. Every call made from .NET must be marshaled to local COM proxy from there it must be marshaled from one process (your app) to the COM server (Excel) (through IPC inside Windows kernel) then it gets translated (dispatched) from the server's local proxy into a native code where arguments get marshaled from OLE Automation compatible types into native types, their validity checked and the function is performed. Result of the function travels back approximately same way through several layers between 2 different processes.
So each and every command is quite expensive to execute, the more of them you do the slower the whole process is. You can find lots of documentation all around the web as COM is old and well working standard (somehow dying with Visual Basic 6).
One example of such article is here: http://www.codeproject.com/Articles/990/Understanding-Classic-COM-Interoperability-With-NE
Is there a quicker way to read?
ClosedXML can both read and write Excel xlsx files (even formulas, formatting and stuff) using Microsoft's OpenXml SDK, see here: https://closedxml.codeplex.com/wikipage?title=Finding%20and%20extracting%20the%20data&referringTitle=Documentation
Excel data reader claims to be able to read both legacy and new Excel data files, I did not try it myself, take a look here: https://exceldatareader.codeplex.com/
another way to read data faster is to use Excel automation to translate sheet into a data file that you can understand easily and batch process without the interop layer (e.g. XML,CSV). This answer shows how to do it

Short answer: correct, interop is slow. (had the same problem, taking couple of seconds to read 300 lines...
Use a library for this:
http://epplus.codeplex.com/
http://npoi.codeplex.com/

This answer is only about the second part of your question.
Your are using lots of ranges there which is not as intended and indeed very slow.
First read the complete range and then iterate over the result like so:
var xx[,] = (MySheet.Cells["A1", "XX100"] as Excel.Range).Value2;
for (int i=0;i<xx.getLength(0);i++)
{
for (int j=0;j<xx.getLength(1);j++)
{
Console.WriteLine(xx[i,j].toString());
}
}
This will be much faster!

You can use this free library, xls & xlsx supported,
Workbook wb = new Workbook();
wb.LoadFromFile(ofd.FileName);
https://freenetexcel.codeplex.com/

Epplus, Out of memory

Sorry for my English. I used the library Epplus and I really like it. But I've got a problem: Out of Memory. Need to write large amounts of data, no matter what. I want to know is it possible to append to the end of the Excel file is not stored in the memory of all. Or create multiple files and then concatenate into one file. Thanks in advance.

1)if you retrieve your data from database
use a datareader instead of datatable
2)write the excel to a temp file, delete it after done(if it's web environment, use response.writefile then delete it)
3)write the header first then append data to it
something like this (using my phone to type this)
var pck = new ExcelPackage();
var ws = pck.AddSheet("sheet1");
//write header here
pck.saveas(fileinfo);
pck.dispose(); // not sure if function existed
pck= new excelpage(fileino.fullname);
ws = pck.worksheets[1];
var rowIndex =0;
while (reader.read())
{
if (++rowindex % 100000 == 0)
{
// save and re-open
}
//write row here
}
pck.save();
//dispose / send file / delete file etc

Saving Excel 2007 documents

In .NET C# I'm trying to open an Excel template, add some data and save it as a new document. I'm trying to use the OpenXML document format. I can't seem to find any guidance on how to do this. Seems like all the documentation talks about how to write various parts to the Package but I can't find anything on what to do when you're done and want to save it.
Anyone know where I can find this information? I must be thinking about this incorrectly because I'm not finding anything useful on what seems to be very basic.
Thanks

ExcelPackage works pretty good for that. It hasn't been worked on by the primary author I dont think for a little while but it has a good following of people on its forum that work any issues out.
FileInfo template = new FileInfo(Path.GetDirectoryName(Application.ExecutablePath)+"\\Template.xlsx");
try
{
using (ExcelPackage xlPackage = new ExcelPackage(strFileName,template))
{
//Enable DEBUG mode to create the xl folder (equlivant to expanding a xlsx.zip file)
//xlPackage.DebugMode = true;
ExcelWorksheet worksheet = xlPackage.Workbook.Worksheets["Sheet1"];
worksheet.Name = WorkSheetName;
foreach (DataRow row in dt.Rows)
{
int c = 1;
if (r > startRow) worksheet.InsertRow(r);
// our query has the columns in the right order, so simply
// iterate through the columns
foreach (DataColumn col in dt.Columns)
{
if (row[col].ToString() != null)
{
worksheet.Cell(r, c).Value = colValue;
worksheet.Column(c).Width = 10;
}
c++;
}
r++;
}
// change the sheet view to show it in page layout mode
worksheet.View.PageLayoutView = false;
// save our new workbook and we are done!
xlPackage.Save();
xlPackage.Dispose();
}
}

Accessing Open XML / SpreadsheetML documents is far from a trivial exercise. The specification is large and complex. The "Open XML SDK" (google it) definitely helps, but still requires some knowledge of the Open XML standard to get much done.
SpreadsheetGear for .NET has an API similar to Excel and can read and write Excel Open XML (xlsx) documents as well as Excel 97-2003 (xls) documents.
You can see some SpreadsheetGear samples here and download a free trial here.
Disclaimer: I own SpreadsheetGear LLC

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.