How to process extremely large .xlsx files with C#

How to process extremely large .xlsx files with C# - c#

Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.
Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.

For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}

Related

How do create and make user download an Excel file from variable data in arrays?

I'm working on a project in .NET Core 1.1, and now I have to give the possibility to the user of downloading an Excel file which data is dependant on parameters chosen by the user, so the Excel should be created in the moment when the user clicks the "Export to Excel" button and downloaded.
I've been searching on the internet but I haven't gotten any clear answers to be honest. I guess I will have to use the Open XML SDK, but in order to create it in memory and such, I don't have enough knowledge.
To sum up, I have data in arrays, and I would like to be able, in the moment the user clicks a button, create the excel virtually with the data previously stored in arrays and then download it in the users browser.

Do you also want to show your data before download? Maybe you can use DataTables:
DataTables is a plug-in for the jQuery Javascript library. It is a
highly flexible tool, built upon the foundations of progressive
enhancement, that adds all of these advanced features to any HTML
table.
Pagination Previous, next and page navigation. Instant search Filter
results by text search.

using NPOI https://github.com/tonyqus/npoi
something in the line of this:
void safearrayAsExcel(object[,] rows,string filename){
var workbook = new HSSFWorkbook();
var sheet = workbook.CreateSheet("New Sheet");
for(int i = 0; i < rows.length;i++)
{
var row sheet.CreateRow(i);
for(int j = 0; j < row.length;j++)
row .CreateCell(j).SetCellValue(rows[i,j]);
}
FileStream fileOut = new FileStream(fileName, FileMode.Create);
workbook.Write(fileOut);
}
you can ofcourse use a memorystream instead of filestream and do whatever you want with the generated excel
use HSSFWorkbook for xls format. and XSSFWorkbook for xlsx format.
i dont completely know about compatibliy issues with .net core 1.1 and .net standard 2.0 but there should be a way go get it work

Strange results from OpenReadAsync() when reading data from Azure Blob storage

I'm having a go at modifying an existing C# (dot net core) app that reads a type of binary file to use Azure Blob Storage.
I'm using Windows.Azure.Storage (8.6.0).
The issue is that this app reads the binary data from files from a Stream in very small blocks (e.g. 5000-6000 bytes). This reflects how the data is structured.
Example pseudo code:
var blocks = new List<byte[]>();
var numberOfBytesToRead = 6240;
var numberOfBlocksToRead = 1700;
using (var stream = await blob.OpenReadAsync())
{
stream.Seek(3000, SeekOrigin.Begin); // start reading at a particular position
for (int i = 1; i <= numberOfBlocksToRead; i++)
{
byte[] traceValues = new byte[numberOfBytesToRead];
stream.Read(traceValues, 0, numberOfBytesToRead);
blocks.Add(traceValues);
}
}`
If I try to read a 10mb file using OpenReadAsync(), I get invalid/junk values in the byte arrays after around 4,190,000 bytes.
If I set StreamMinimumReadSize to 100Mb it works.
If I read more data per block (e.g. 1mb) it works.
Some of the files can be more than 100Mb, so setting the StreamMinimumReadSize may not be the best solution.
What is going on here, and how can I fix this?

Are the invalid/junk values zeros? If so (and maybe even if not) check the return value from stream.Read. That method is not guaranteed to actually read the number of bytes that you ask it to. It can read less. In which case you are supposed to call it again in a loop, until it has read the total amount that you want. A quick web search should show you lots of examples of the necessary looping.

SQL Server, C# and iTextSharp. Whats best way to join pdfs

I have a sql server db. In there are many, many rows. Each row has a column that contains a stored pdf.
The db is a gig in size. So we can expect roughly half that size is due to the pdfs.
now I have a requirement to join all those pdf's ... into 1 pdf. Don't ask why.
Can you suggest the best way forward and which component will be best suited for this job. There are many answers available:
How can I join two PDF's using iTextSharp?
Merge memorystreams to one itext document
How to merge multiple pdf files (generated in run time)?
as to how to join two (or more pdfs). But what I'm asking for is in terms of performance. We literally dealing with around 50 000 pdfs that need to be merged into 1 almighty pdf
[Edit Solution] Brought time to merge 1000 pdfs from 4m30s to 21s
public void MergePDFs(string targetPDF, string sourceDir)
{
using (FileStream stream = new FileStream(targetPDF, FileMode.Create))
{
var files = Directory.GetFiles(sourceDir);
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
Console.WriteLine("Merging files count: " + files.Length);
int i = 1;
var watch = System.Diagnostics.Stopwatch.StartNew();
foreach (string file in files)
{
Console.WriteLine(i + ". Adding: " + file);
pdf.AddDocument(new PdfReader(file));
i++;
}
if (pdfDoc != null)
pdfDoc.Close();
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
MessageBox.Show(elapsedMs.ToString());
}
}

I just did a C#/Winforms project with PDFSharp and merging images to PDFs and it worked phenomenally with a traditional folder structure. I imagine that it would work similarly with data stored PDFs so long as you can pull them into a memory stream first then merge them.
Some suggestions:
1) Recommend doing it in a multi-threaded environment so you can work on multiple PDFs at a time.
2) Open only what you need and close as soon as the operation is complete. So say you have three documents that need to be merged into one. Create a blank PDF. Open first into a memory stream, open blank. Append first to blank. Close first, save blank, close blank. Repeat for second and third. This way you control how much memory you are taking up at any one point in time. In this way I was able to append millions of images, but control memory usage.
3) Ensure you are using the Using statements when utilizing objects. This will help with memory cleanup and eliminate the need for calling garbage collector which is looked down upon.
4) Separate your business (work) from your UI as best you can so you can cancel the operation at any point in time, or view current status as it progresses through.
5) Log everything that is done so that you can go back and correct one-offs for the PDFs that didn't make it through the first pass.

Interop Excel is slow

I am writing an application to open an Excel sheet and read it
MyApp = new Excel.Application();
MyBook = MyApp.Workbooks.Open(filename);
MySheet = (Excel.Worksheet)MyBook.Sheets[1]; // Explict cast is not required here
lastRow = MySheet.Cells.SpecialCells(Excel.XlCellType.xlCellTypeLastCell).Row;
MyApp.Visible = false;
It takes about 6-7 seconds for this to take place, is this normal with interop Excel?
Also is there a quicker way to Read an Excel than this?
string[] xx = new string[lastRow];
for (int index = 1; index <= lastRow; index++)
{
int maxCol = endCol - startCol;
for (int j = 1; j <= maxCol; j++)
{
try
{
xx[index - 1] += (MySheet.Cells[index, j] as Excel.Range).Value2.ToString();
}
catch
{
}
if (j != maxCol) xx[index - 1] += "|";
}
}
MyApp.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(MySheet);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyBook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(MyApp);

Appending to the answer of #RvdK - yes COM interop is slow.
Why is it slow?
It is due to the fact how it works. Every call made from .NET must be marshaled to local COM proxy from there it must be marshaled from one process (your app) to the COM server (Excel) (through IPC inside Windows kernel) then it gets translated (dispatched) from the server's local proxy into a native code where arguments get marshaled from OLE Automation compatible types into native types, their validity checked and the function is performed. Result of the function travels back approximately same way through several layers between 2 different processes.
So each and every command is quite expensive to execute, the more of them you do the slower the whole process is. You can find lots of documentation all around the web as COM is old and well working standard (somehow dying with Visual Basic 6).
One example of such article is here: http://www.codeproject.com/Articles/990/Understanding-Classic-COM-Interoperability-With-NE
Is there a quicker way to read?
ClosedXML can both read and write Excel xlsx files (even formulas, formatting and stuff) using Microsoft's OpenXml SDK, see here: https://closedxml.codeplex.com/wikipage?title=Finding%20and%20extracting%20the%20data&referringTitle=Documentation
Excel data reader claims to be able to read both legacy and new Excel data files, I did not try it myself, take a look here: https://exceldatareader.codeplex.com/
another way to read data faster is to use Excel automation to translate sheet into a data file that you can understand easily and batch process without the interop layer (e.g. XML,CSV). This answer shows how to do it

Short answer: correct, interop is slow. (had the same problem, taking couple of seconds to read 300 lines...
Use a library for this:
http://epplus.codeplex.com/
http://npoi.codeplex.com/

This answer is only about the second part of your question.
Your are using lots of ranges there which is not as intended and indeed very slow.
First read the complete range and then iterate over the result like so:
var xx[,] = (MySheet.Cells["A1", "XX100"] as Excel.Range).Value2;
for (int i=0;i<xx.getLength(0);i++)
{
for (int j=0;j<xx.getLength(1);j++)
{
Console.WriteLine(xx[i,j].toString());
}
}
This will be much faster!

You can use this free library, xls & xlsx supported,
Workbook wb = new Workbook();
wb.LoadFromFile(ofd.FileName);
https://freenetexcel.codeplex.com/

Unexpected results using Excel Data Reader

I'm reading an XLSX (Microsoft Excel XML file) using the Excel Data Reader from http://exceldatareader.codeplex.com/ and I getting some unexpected results.
The following code outputs data from multiple tabs
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
On a single line, I can get data from 3 or more tabs.
I would expect this to loop through and show all of the contents of the first tab and only the first tab.
What am I missing?
Update: It appears that the above code does work fine if there is only 1 tab in the excel file. This may just be a bug with this library. Has anyone else used this library to parse excel files with multiple tabs?
Thanks

OK, so my reply is extremely late with reference to this question, but if its any help try encapsulating your code in a reader.NextResult() block. This works the same way as when you parse through multiple DataTable objects within a DataSet.
Additionally, this approach has a very small memory footprint as opposed to the reader.AsDataSet() method, which hogs a lot of memory even for workbooks as small as 20MBs
eg
var reader = Excel.ExcelReaderFactory.CreateOpenXmlReader(uploadFile.InputStream);
do
{
while (reader.Read())
{
System.Diagnostics.Debug.WriteLine(reader.FieldCount );
for (int i = 0; i < reader.FieldCount; i++)
{
System.Diagnostics.Debug.Write(reader[i] + "*");
}
System.Diagnostics.Debug.WriteLine("\n~\n");
}
}while(reader.NextResult());

Which is why I am using NPOI. I have tried several other Excel readers, this one actually worked for me.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.