Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.
Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.
For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}
I am using the EPPlus library to create an Excel workbook with many worksheets. I was wondering if it is safe to build the worksheets in parallel. I could not find a mention in the (limited) documentation if the library supports this kind of behavior:
package = new ExcelPackage();
int start = 1;
int end = 100;
Parallel.For(start, end; s =>
{
var worksheet = package.Workbook.Worksheets.Add("Worksheet" + s.ToString());
//routine to populate data here
});
Take a short look for the source code: https://github.com/JanKallman/EPPlus/blob/master/EPPlus/ExcelWorksheets.cs
As you can see, the Add method calls the AddSheet method that uses lock(...) to make the operation thread-safe, so yes, you can use the Parallel.For.
I'm reading an XLSX (Microsoft Excel XML file) using the Excel Data Reader from http://exceldatareader.codeplex.com/ and I getting some unexpected results.
The following code outputs data from multiple tabs
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
On a single line, I can get data from 3 or more tabs.
I would expect this to loop through and show all of the contents of the first tab and only the first tab.
What am I missing?
Update: It appears that the above code does work fine if there is only 1 tab in the excel file. This may just be a bug with this library. Has anyone else used this library to parse excel files with multiple tabs?
Thanks
OK, so my reply is extremely late with reference to this question, but if its any help try encapsulating your code in a reader.NextResult() block. This works the same way as when you parse through multiple DataTable objects within a DataSet.
Additionally, this approach has a very small memory footprint as opposed to the reader.AsDataSet() method, which hogs a lot of memory even for workbooks as small as 20MBs
eg
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
do
{
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
}while(reader.NextResult());
Which is why I am using NPOI. I have tried several other Excel readers, this one actually worked for me.
I have some tabular data that I'd like to turn into an Excel table.
Software available:
.NET 4 (C#)
Excel 2010 (using the Excel API is OK)
I prefer not to use any 3rd party libraries
Information about the data:
A couple million rows
5 columns, all strings (very simple and regular table structure)
In my script I'm currently using a nested List data structure but I can change that
Performance of the script is not critical
Searching online gives many results, and I'm confused whether I should use OleDb, ADO RecordSets, or something else. Some of these technologies seem like overkill for my scenario, and some seem like they might be obsolete.
What is the very simplest way to do this?
Edit: this is a one-time script I intend to run from my attended desktop.
Avoid using COM interop at all costs. Use a third-party API. Really. In fact, if you're doing this server-side, you virtually have to. There are plenty of free options. I highly recommend using EPPlus, but there are also enterprise-level solutions available. I've used EPPlus a fair amount, and it works great. Unlike interop, it allows you to generate Excel files without requiring Excel to be installed on the machine, which means you also don't have to worry about COM objects sticking around as background processes. Even with proper object disposal, the Excel processes don't always end.
http://epplus.codeplex.com/releases/view/42439
I know you said you want to avoid third-party libraries, but they really are the way to go. Microsoft does not recommend automating Office. It's really not meant to be automated anyway.
http://support.microsoft.com/kb/257757
However, you may want to reconsider inserting "a couple million rows" into a single spreadsheet.
Honoring your request to avoid 3rd party tools and using COM objects, here's how I'd do it.
Add reference to project: Com object
Microsoft Excel 11.0.
Top of module add:
using Microsoft.Office.Interop.Excel;
Add event logic like this:
private void DoThatExcelThing()
{
ApplicationClass myExcel;
try
{
myExcel = GetObject(,"Excel.Application")
}
catch (Exception ex)
{
myExcel = New ApplicationClass()
}
myExcel.Visible = true;
Workbook wb1 = myExcel.Workbooks.Add("");
Worksheet ws1 = (Worksheet)wb1.Worksheets[1];
//Read the connection string from App.Config
string strConn = System.Configuration.ConfigurationManager.ConnectionStrings["NewConnString"].ConnectionString;
//Open a connection to the database
SqlConnection myConn = new SqlConnection();
myConn.ConnectionString = strConn;
myConn.Open();
//Establish the query
SqlCommand myCmd = new SqlCommand("select * from employees", myConn);
SqlDataReader myRdr = myCmd.ExecuteReader();
//Read the data and put into the spreadsheet.
int j = 3;
while (myRdr.Read())
{
for (int i=0 ; i < myRdr.FieldCount; i++)
{
ws1.Cells[j, i+1] = myRdr[i].ToString();
}
j++;
}
//Populate the column names
for (int i = 0; i < myRdr.FieldCount ; i++)
{
ws1.Cells[2, i+1] = myRdr.GetName(i);
}
myRdr.Close();
myConn.Close();
//Add some formatting
Range rng1 = ws1.get_Range("A1", "H1");
rng1.Font.Bold = true;
rng1.Font.ColorIndex = 3;
rng1.HorizontalAlignment = XlHAlign.xlHAlignCenter;
Range rng2 = ws1.get_Range("A2", "H50");
rng2.WrapText = false;
rng2.EntireColumn.AutoFit();
//Add a header row
ws1.get_Range("A1", "H1").EntireRow.Insert(XlInsertShiftDirection.xlShiftDown, Missing.Value);
ws1.Cells[1, 1] = "Employee Contact List";
Range rng3 = ws1.get_Range("A1", "H1");
rng3.Merge(Missing.Value);
rng3.Font.Size = 16;
rng3.Font.ColorIndex = 3;
rng3.Font.Underline = true;
rng3.Font.Bold = true;
rng3.VerticalAlignment = XlVAlign.xlVAlignCenter;
//Save and close
string strFileName = String.Format("Employees{0}.xlsx", DateTime.Now.ToString("HHmmss"));
System.IO.File.Delete(strFileName);
wb1.SaveAs(strFileName, XlFileFormat.xlWorkbookDefault, Missing.Value, Missing.Value, Missing.Value, Missing.Value,
XlSaveAsAccessMode.xlExclusive, Missing.Value, false, Missing.Value, Missing.Value, Missing.Value);
myExcel.Quit();
}
Some things for your consideration...
If this is a client side solution, there is nothing wrong with using Interops.
If this is a server side solution, Don't use Interops. Good alternative is OpenXML SDK from Microsoft if you don't want 3rd party solution. It's free. I believe the latest one has similar object model that Excel has. It's a lot faster, A LOT, in generating the workbook vs going the interops way which can bog down your server.
I once read that the easiest way to create an Excel table was to actualy write a HTML table, including its structure and data, and simply name the file .xls.
Excel will be able to convert it, but it will display a warning saying that the content does not match the extension.
I agree that a 3rd party dll would be cleaner than the com, but if you go the interop route...
Hands down the best way to populate an excel sheet is to first put the data in a 2 dimensional string array, then get an excel range object with the same dimensions and set it (range.set_value2(oarray) I think). Using any other method is hideously slow.
Also be sure you use the appropriate cleanup code in your finally block.
i implemented "export to Excel" with the ms-access-ole-db-driver that can also read and write excel files the follwoing way:
preparation (done once)
create an excel file that contains all (header, Formatting, formulas, diagrams) with an empty data area as a template to be filled
give the data area (including the headers) a name (ie "MyData")
Implementing export
copy template file to destination folder
open an oledb-database connection to the destination file
use sql to insert data
Example
Excel table with Named area "MyData"
Name, FamilyName, Birthday
open System.Data.OleDb.OleDbConnection
execute sql "Insert into MyData(Name, FamilyName, Birthday) values(...)"
I used this connection string
private const string FORMAT_EXCEL_CONNECT =
// #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=""Excel 8.0;HDR={1}""";
#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=""Excel 12.0;HDR={1}""";
private static string GetExcelConnectionString(string excelFilePath, bool header)
{
return string.Format(FORMAT_EXCEL_CONNECT,
excelFilePath,
(header) ? "Yes" : "No"
);
}
I want to read excel file but in this way is too slow. What pattern should I use to read excel file faster. Should I try csv ?
I am using the following code:
ApplicationClass excelApp = excelApp = new ApplicationClass();
Workbook myWorkBook = excelApp.Workbooks.Open(#"C:\Users\OWNER\Desktop\Employees.xlsx");
Worksheet mySheet = (Worksheet)myWorkBook.Sheets["Sheet1"];
for (int row = 1; row <= mySheet.UsedRange.Rows.Count; row++)
{
for (int col = 1; col <= mySheet.UsedRange.Columns.Count; col++)
{
Range dataRange = (Range)mySheet.Cells[row, col];
Console.Write(String.Format(dataRange.Value2.ToString() + " "));
}
Console.WriteLine();
}
excelApp.Quit();
The reason your program is slow is because you are using Excel to open your Excel files. Whenever you are doing anything with the file you have to do a COM+ interop, which is extremely slow, as you have to pass memory across two different processes.
Microsoft has dropped support for reading .xlsx files using Excel interop. They released the OpenXML library specifically for this reason.
I suggest you use a wrapper library for using OpenXML, since the API is pretty hairy. You can check out this SO for how to use it correctly.
open xml reading from excel file
You're accessing Excel file through excel interop. By doing reads cell by cell you're doing a lot of P/Invoke's which is not very performant.
You can read data in ranges, not cell by cell. This loads the data into memory and you could iterate it much faster. (Eg. try to load column by column.)
BTW: You could use some library instead like http://epplus.codeplex.com which reads excel files directly.
Excel Data Reader
Lightweight and very fast if reading is your only concern.