Interop Excel is slow

Interop Excel is slow - c#

I am writing an application to open an Excel sheet and read it
MyApp = new Excel.Application();
MyBook = MyApp.Workbooks.Open(filename);
MySheet = (Excel.Worksheet)MyBook.Sheets[1]; // Explict cast is not required here
lastRow = MySheet.Cells.SpecialCells(Excel.XlCellType.xlCellTypeLastCell).Row;
MyApp.Visible = false;
It takes about 6-7 seconds for this to take place, is this normal with interop Excel?
Also is there a quicker way to Read an Excel than this?
string[] xx = new string[lastRow];
for (int index = 1; index <= lastRow; index++)
{
int maxCol = endCol - startCol;
for (int j = 1; j <= maxCol; j++)
{
try
{
xx[index - 1] += (MySheet.Cells[index, j] as Excel.Range).Value2.ToString();
}
catch
{
}
if (j != maxCol) xx[index - 1] += "|";
}
}
MyApp.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(MySheet);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyBook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyApp);

Appending to the answer of #RvdK - yes COM interop is slow.
Why is it slow?
It is due to the fact how it works. Every call made from .NET must be marshaled to local COM proxy from there it must be marshaled from one process (your app) to the COM server (Excel) (through IPC inside Windows kernel) then it gets translated (dispatched) from the server's local proxy into a native code where arguments get marshaled from OLE Automation compatible types into native types, their validity checked and the function is performed. Result of the function travels back approximately same way through several layers between 2 different processes.
So each and every command is quite expensive to execute, the more of them you do the slower the whole process is. You can find lots of documentation all around the web as COM is old and well working standard (somehow dying with Visual Basic 6).
One example of such article is here: http://www.codeproject.com/Articles/990/Understanding-Classic-COM-Interoperability-With-NE
Is there a quicker way to read?
ClosedXML can both read and write Excel xlsx files (even formulas, formatting and stuff) using Microsoft's OpenXml SDK, see here: https://closedxml.codeplex.com/wikipage?title=Finding%20and%20extracting%20the%20data&referringTitle=Documentation
Excel data reader claims to be able to read both legacy and new Excel data files, I did not try it myself, take a look here: https://exceldatareader.codeplex.com/
another way to read data faster is to use Excel automation to translate sheet into a data file that you can understand easily and batch process without the interop layer (e.g. XML,CSV). This answer shows how to do it

Short answer: correct, interop is slow. (had the same problem, taking couple of seconds to read 300 lines...
Use a library for this:
http://epplus.codeplex.com/
http://npoi.codeplex.com/

This answer is only about the second part of your question.
Your are using lots of ranges there which is not as intended and indeed very slow.
First read the complete range and then iterate over the result like so:
var xx[,] = (MySheet.Cells["A1", "XX100"] as Excel.Range).Value2;
for (int i=0;i<xx.getLength(0);i++)
{
for (int j=0;j<xx.getLength(1);j++)
{
Console.WriteLine(xx[i,j].toString());
}
}
This will be much faster!

You can use this free library, xls & xlsx supported,
Workbook wb = new Workbook();
wb.LoadFromFile(ofd.FileName);
https://freenetexcel.codeplex.com/

Related

How to process extremely large .xlsx files with C#

Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.
Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.

For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}

Creating Excel worksheets using Parallel.For and EPPlus

I am using the EPPlus library to create an Excel workbook with many worksheets. I was wondering if it is safe to build the worksheets in parallel. I could not find a mention in the (limited) documentation if the library supports this kind of behavior:
package = new ExcelPackage();
int start = 1;
int end = 100;
Parallel.For(start, end; s =>
{
var worksheet = package.Workbook.Worksheets.Add("Worksheet" + s.ToString());
//routine to populate data here
});

Take a short look for the source code: https://github.com/JanKallman/EPPlus/blob/master/EPPlus/ExcelWorksheets.cs
As you can see, the Add method calls the AddSheet method that uses lock(...) to make the operation thread-safe, so yes, you can use the Parallel.For.

Unexpected results using Excel Data Reader

I'm reading an XLSX (Microsoft Excel XML file) using the Excel Data Reader from http://exceldatareader.codeplex.com/ and I getting some unexpected results.
The following code outputs data from multiple tabs
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
On a single line, I can get data from 3 or more tabs.
I would expect this to loop through and show all of the contents of the first tab and only the first tab.
What am I missing?
Update: It appears that the above code does work fine if there is only 1 tab in the excel file. This may just be a bug with this library. Has anyone else used this library to parse excel files with multiple tabs?
Thanks

OK, so my reply is extremely late with reference to this question, but if its any help try encapsulating your code in a reader.NextResult() block. This works the same way as when you parse through multiple DataTable objects within a DataSet.
Additionally, this approach has a very small memory footprint as opposed to the reader.AsDataSet() method, which hogs a lot of memory even for workbooks as small as 20MBs
eg
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
do
{
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
}while(reader.NextResult());

Which is why I am using NPOI. I have tried several other Excel readers, this one actually worked for me.

What's the easiest way to create an Excel table with C#?

I have some tabular data that I'd like to turn into an Excel table.
Software available:
.NET 4 (C#)
Excel 2010 (using the Excel API is OK)
I prefer not to use any 3rd party libraries
Information about the data:
A couple million rows
5 columns, all strings (very simple and regular table structure)
In my script I'm currently using a nested List data structure but I can change that
Performance of the script is not critical
Searching online gives many results, and I'm confused whether I should use OleDb, ADO RecordSets, or something else. Some of these technologies seem like overkill for my scenario, and some seem like they might be obsolete.
What is the very simplest way to do this?
Edit: this is a one-time script I intend to run from my attended desktop.

Avoid using COM interop at all costs. Use a third-party API. Really. In fact, if you're doing this server-side, you virtually have to. There are plenty of free options. I highly recommend using EPPlus, but there are also enterprise-level solutions available. I've used EPPlus a fair amount, and it works great. Unlike interop, it allows you to generate Excel files without requiring Excel to be installed on the machine, which means you also don't have to worry about COM objects sticking around as background processes. Even with proper object disposal, the Excel processes don't always end.
http://epplus.codeplex.com/releases/view/42439
I know you said you want to avoid third-party libraries, but they really are the way to go. Microsoft does not recommend automating Office. It's really not meant to be automated anyway.
http://support.microsoft.com/kb/257757
However, you may want to reconsider inserting "a couple million rows" into a single spreadsheet.

Honoring your request to avoid 3rd party tools and using COM objects, here's how I'd do it.
Add reference to project: Com object
Microsoft Excel 11.0.
Top of module add:
using Microsoft.Office.Interop.Excel;
Add event logic like this:
private void DoThatExcelThing()
{
ApplicationClass myExcel;
try
{
myExcel = GetObject(,"Excel.Application")
}
catch (Exception ex)
{
myExcel = New ApplicationClass()
}
myExcel.Visible = true;
Workbook wb1 = myExcel.Workbooks.Add("");
Worksheet ws1 = (Worksheet)wb1.Worksheets[1];
//Read the connection string from App.Config
string strConn = System.Configuration.ConfigurationManager.ConnectionStrings["NewConnString"].ConnectionString;
//Open a connection to the database
SqlConnection myConn = new SqlConnection();
myConn.ConnectionString = strConn;
myConn.Open();
//Establish the query
SqlCommand myCmd = new SqlCommand("select * from employees", myConn);
SqlDataReader myRdr = myCmd.ExecuteReader();
//Read the data and put into the spreadsheet.
int j = 3;
while (myRdr.Read())
{
for (int i=0 ; i < myRdr.FieldCount; i++)
{
ws1.Cells[j, i+1] = myRdr[i].ToString();
}
j++;
}
//Populate the column names
for (int i = 0; i < myRdr.FieldCount ; i++)
{
ws1.Cells[2, i+1] = myRdr.GetName(i);
}
myRdr.Close();
myConn.Close();
//Add some formatting
Range rng1 = ws1.get_Range("A1", "H1");
rng1.Font.Bold = true;
rng1.Font.ColorIndex = 3;
rng1.HorizontalAlignment = XlHAlign.xlHAlignCenter;
Range rng2 = ws1.get_Range("A2", "H50");
rng2.WrapText = false;
rng2.EntireColumn.AutoFit();
//Add a header row
ws1.get_Range("A1", "H1").EntireRow.Insert(XlInsertShiftDirection.xlShiftDown, Missing.Value);
ws1.Cells[1, 1] = "Employee Contact List";
Range rng3 = ws1.get_Range("A1", "H1");
rng3.Merge(Missing.Value);
rng3.Font.Size = 16;
rng3.Font.ColorIndex = 3;
rng3.Font.Underline = true;
rng3.Font.Bold = true;
rng3.VerticalAlignment = XlVAlign.xlVAlignCenter;
//Save and close
string strFileName = String.Format("Employees{0}.xlsx", DateTime.Now.ToString("HHmmss"));
System.IO.File.Delete(strFileName);
wb1.SaveAs(strFileName, XlFileFormat.xlWorkbookDefault, Missing.Value, Missing.Value, Missing.Value, Missing.Value,
XlSaveAsAccessMode.xlExclusive, Missing.Value, false, Missing.Value, Missing.Value, Missing.Value);
myExcel.Quit();
}

Some things for your consideration...
If this is a client side solution, there is nothing wrong with using Interops.
If this is a server side solution, Don't use Interops. Good alternative is OpenXML SDK from Microsoft if you don't want 3rd party solution. It's free. I believe the latest one has similar object model that Excel has. It's a lot faster, A LOT, in generating the workbook vs going the interops way which can bog down your server.

I once read that the easiest way to create an Excel table was to actualy write a HTML table, including its structure and data, and simply name the file .xls.
Excel will be able to convert it, but it will display a warning saying that the content does not match the extension.

I agree that a 3rd party dll would be cleaner than the com, but if you go the interop route...
Hands down the best way to populate an excel sheet is to first put the data in a 2 dimensional string array, then get an excel range object with the same dimensions and set it (range.set_value2(oarray) I think). Using any other method is hideously slow.
Also be sure you use the appropriate cleanup code in your finally block.

i implemented "export to Excel" with the ms-access-ole-db-driver that can also read and write excel files the follwoing way:
preparation (done once)
create an excel file that contains all (header, Formatting, formulas, diagrams) with an empty data area as a template to be filled
give the data area (including the headers) a name (ie "MyData")
Implementing export
copy template file to destination folder
open an oledb-database connection to the destination file
use sql to insert data
Example
Excel table with Named area "MyData"
Name, FamilyName, Birthday
open System.Data.OleDb.OleDbConnection
execute sql "Insert into MyData(Name, FamilyName, Birthday) values(...)"
I used this connection string
private const string FORMAT_EXCEL_CONNECT =
// #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=""Excel 8.0;HDR={1}""";
#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=""Excel 12.0;HDR={1}""";
private static string GetExcelConnectionString(string excelFilePath, bool header)
{
return string.Format(FORMAT_EXCEL_CONNECT,
excelFilePath,
(header) ? "Yes" : "No"
);
}

Slow Performance When Reading Excel

I want to read excel file but in this way is too slow. What pattern should I use to read excel file faster. Should I try csv ?
I am using the following code:
ApplicationClass excelApp = excelApp = new ApplicationClass();
Workbook myWorkBook = excelApp.Workbooks.Open(#"C:\Users\OWNER\Desktop\Employees.xlsx");
Worksheet mySheet = (Worksheet)myWorkBook.Sheets["Sheet1"];
for (int row = 1; row <= mySheet.UsedRange.Rows.Count; row++)
{
for (int col = 1; col <= mySheet.UsedRange.Columns.Count; col++)
{
Range dataRange = (Range)mySheet.Cells[row, col];
Console.Write(String.Format(dataRange.Value2.ToString() + " "));
}
Console.WriteLine();
}
excelApp.Quit();

The reason your program is slow is because you are using Excel to open your Excel files. Whenever you are doing anything with the file you have to do a COM+ interop, which is extremely slow, as you have to pass memory across two different processes.
Microsoft has dropped support for reading .xlsx files using Excel interop. They released the OpenXML library specifically for this reason.
I suggest you use a wrapper library for using OpenXML, since the API is pretty hairy. You can check out this SO for how to use it correctly.
open xml reading from excel file

You're accessing Excel file through excel interop. By doing reads cell by cell you're doing a lot of P/Invoke's which is not very performant.
You can read data in ranges, not cell by cell. This loads the data into memory and you could iterate it much faster. (Eg. try to load column by column.)
BTW: You could use some library instead like http://epplus.codeplex.com which reads excel files directly.

Excel Data Reader
Lightweight and very fast if reading is your only concern.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.