Context
I have been using EPPLUS as my tool to automate excel report generation, using C# as the client language of the library.
Problem:
After trying to write a really big report (response of a SQL Query), with pivot tables, charts and so forth, i end up having a Out of Memory Exception.
TroubleShooting
In order to troubleshoot, i decided to open an existing report that has 138MB, and use the GC object to try to take a peek on what's happening with my memory, and here are the results.
ExcelPackage pkg = new ExcelPackage (new FileInfo (#"PATH TO THE REPORT.xlsx"));
ExcelWorkbook wb = pkg.Workbook;
Garbage Collection Results, before the second line of code, and after.
So, i have no idea what to do from now on. All i am doing is opening the report, which is consuming roughtly 10 (9.98 actually) times the report size itself, on memory.
The ~138MB of the excel file, takes up 1.370.817.264 bytes of RAM.
Update One:
There's a fairly recent beta version of EPPlus that's out that has on it's changelog:
New Cell store
* Less memory consumtion
* Insert columns (not on the range level)
* Faster row inserts
After updating the Nuget, i still have the same exception, that is thrown after the first line, instead of being raised on the second line.
Modern Excel files, ie, Xlsx files are zip-compressed, and often achieve compression down to 10%. I just uncompressed a 1.6MB file I generated using a similar tool and found it extracted to 18.8 MB of data.
You've got a 0.138 GB file that is using 1.370 GB of memory, which is almost exactly 10%. The uncompressed representation in memory is what is eating your memory.
If you're curious, you can use a tool like 7-Zip to extract the Xlsx files, or you can rename the file to end in .zip and browse it in Windows.
As I've encountered this too, and found no real solution, I've had to come up with the solution by myself.
It comes as a new library: https://github.com/danielgindi/SpreadsheetStreams.net
This is based on taking a very old piece of code of mine, that supported csv and xml, refactor the interface, add xlsx support, and publish as a standalone library.
This is not a replacement for EPPlus or other spreadsheet manipulation libraries, this one is just about streaming generation of reports. Not all excel features are there also.
Related
No, ADO.NET will not solve my problem because the excel files I'm working with do not contain information in tabular form. In other words, there is nothing to query, and the name of the sheets and number of sheets will vary.
essentially my job is to search every single cell in an excel document and validate it against some other data.
Right now all I have is a byte[] array that represents the contents of an .xls file. Converting to a string is meaningless since it's just binary data.
If I use COM interop and run Excel in the background, is it possible to inject it with binary data in byte[] array form or do I have to save the file to disk and then automate the process of opening it and scanning each row?
Isn't there an easier way to do it?
How do you read the binary data of an excel file (.xls) using .NET
There are a number of ways, the excel file format has changed a few times so reading the files natively is hard work and version dependent, it's usually not recommended. For reading tabular data most people choose ADO.NET, but as you allude, if you need any formatting or discovery then MS would recommend COM Interop.
If I use COM interop and run Excel in the background, is it possible to inject it with binary data in byte[] array form
The excel COM object model does allow you to bulk set data to a Range object you set it with a 2 dimensional object array (object[,])
or do I have to save the file to disk and then automate the process of opening it and scanning each row?
No, you can interact with the "out of process" COM server (Excel) without having to save first, you can set your data, format it etc in memory.
Isn't there an easier way to do it?
Yes there is, checkout Spreadsheet Gear their object model is nearly identical to the com model, however you do not need Excel involved at all, it is also an order of magnitude faster working with large data. Its not cheap ($1000 bucks last time I checked) but will save you way more than that in coding effort. (I am not affiliated with Spreadsheet gear in any way)
You could use NPOI to open & read your XLS files, you'll basically want to loop through your Sheets / Rows / Columns looking for data. I commonly use NPOI to read & write XLS forms that contain data in random cells throughout a worksheet.
I need to read XLSX files and extract a maximum amount of content from it. Which of the API's should I use?
OLE DB, open XML SDK, or Excel Interop?
Which is the easiest to use?
Can you retrieve all the information using one or the other? i.e, date, times, merged cells, tables, pivottables, etc.
You can try all of them and choose the one that fits you most...
Depending on data you want to read, I'd suggest you to use Open XML over Interop or Ole DB.
I don't know an open XML SDK, although I've some experience with EPPlus library which I'm using a lot and can say only good words about it - it's fast, easy to learn, with good examples. The library is basing on Open Office XML format, so I suppose it's pretty much the same as the SDK you've mentioned, and is capable of easy read and write Excel 2007 and 2010 files.
On the linked web, you'll find a library itself, documentation and some example "Hello World" projects to download.
Why that library in the first place? Because with it you will be able to read not only cells values, but also their colors, fonts, widths and heights, merging and all that detailed stuff, that you can not only read, but modify as well. What's more, you don't need the Excel installed to do that.
On the second place - just in case you need to extract tabular data from worksheet - you may play with OLE DB. I'm afraid with that you won't be able to extract any info about formats, colors etc., as well as the data must be in a tabular organized worksheet, so you can treat is as a database's table.
The last one is Interop, because:
- it's a COM library, so you need to be very careful when playing with it via .NET, as it's easy to cause some ugly and hard to find memory leaks (confirmed by myself bad experience) - if you don't dispose their objects properly, it leaves the Excel.exe process opened,
- it's much slower than previous methods,
- basically, it has almost no more added value that one of the previous methods (EPPlus or OleDB) and requires Excel to be installed on client's machine, so why to use it?
Good luck, then.
I have code that uses the OpenXML library to export data.
I have 20,000 rows and 22 columns and it takes ages (about 10 minutes).
is there any solution that would export data from C# to excel that would be faster as i am doing this from an asp.net mvc app and many people browsers are timing out.
Assuming 20'000 rows and 22 columns with about 100 bytes each, makes 41 megabytes data alone. plus xml tags, plus formatting, I'd say you end up zipping (.xlsx is nothing but several zipped xml files) 100 mb of data.
Of course this takes a while, and so does fetching the data.
I recommend you use excel package plus instead of the Office OpenXML development kit.
http://epplus.codeplex.com/
There's probably a bug/performance-issue in the write-in-a-hurry-and-hope-that-it-doesnt-blow-up-too-soon Microsoft code.
CSV. It is a plain text file, but can be opened by any version of Excel.
No doubt it is a easier way to export data to excel. A lot of website provide data export as CSV.
What you need to do is just add a comma (,) to separate the values and a line break to separate the records. It won't take extra resource to build the csv file, so it is quite fast.
I wound up using an open source solution called ClosedXML that worked great
Depending on what version of Excel you are targetting, you could expose the data as an OData service which Excel 2010 can naturally consume and will handle the downloading and formattting for you.
I am assuming that this data is something that needs to be completely sent to the client and has already been pre-filtered in some fashion, but still needs to be sent back to the person who made the request.
In this case, you want to perform this particular operation 'asynchronously'. I'm not sure if this would fit your workflow, but say that a person requests this large XML formatted document, I would: a) queue another worker thread to kick off the generation of the document while returning a 'token' (perhaps a GUID to the requester); b) return a link to a page where the requestor can click on the link (passing the token) allowing the page to look up results.
If the thread has completed processing the document, it places it into a special folder with a unique name and adds the token to a database table with its document location. If the person requests that page, the token exists in the database and the document exists on the file system, they are allowed to click and download it through HTTP. If it does not exist, they are either told it does not exist or to wait for the results. (This message can be based on the time the request was received.)
If the person downloads the document successfully (and you can do this through script), you can remove the entry for the database for the document with that token and delete the file from the file system.
I hope I read this question correctly.
I have found that I can speed up exporting data from a database into an Excel spreadsheet by limiting the number of export operations. I found that by accumulating 100 lines of data before writing, the creation speed increased by a factor of at least 5-10x.
The mistake when exporting data that is most often done when exporting data is in the workflow
Build Model
Build XML DOM
Save XML DOM to file
This workflow leads to an overhead because building up the XML DOM needs it's time, the XML DOM is kept in memory together with the Model and then the whole bunch of data is written to a file.
A better way to handle this is to convert your model entry by entry directly to the target format and write it directly to a (buffered) file.
A format with low overhead that's fast to write and is readable by Excel is CSV (ok, it's legacy, it's awkward...).
I have huge excel files that I have to open from web browser. It takes several minutes to load huge file. Is it possible to open a single worksheet (single tab) at a time from excel file that contains many worksheets? I have to do this using C# / asp.net MVC
I'm assuming you have the excel workbook on the server and just want to send a single worksheet to the client. Does the user then edit the worksheet? Will they be uploading it back?
Assuming this is just a report then why not use the OpenXML sdk to read the workbook, extrac the sheet in question and send it back to the client? This is what #Jim in the comments was suggesting. You can get the SDK here: Open XML SDK 2.0 for Microsoft Office . However, I'm not sure if it will work with the 'old' excel format. I assume you'll need to save the template workbook in the new Office formats (xslx).
Your question is slightly unclear as to where the spreadsheet is stored.
If it's on a server you control, process it, extracting sheets you need, and create other sheets which are smaller in size. (Or possibly save them in a different format.).
If they're not on a server you control, download the file using C#, then go through a similiar process of extracting the sheet before opening it.
Having said that, I've dealt with some largish spreadsheets (20MB or so), and haven't really had a problem processing the entire spreadsheet as a whole.
So where is the bottleneck? Your network or possibly the machine you're running?
Use third party components.
We are fighting with server side Excel generation for years and has been defeated.
We bought third party components and all problems gone.
From your question, it seems you want to improve load time by using (opening) the data from one worksheet instead of the whole workbook. If this is the case and you only want the data, then access the workbook using ADO.NET with OLEDB provider. (You can use threading to load each worksheet to improve load performance. For instance, loading three large data sets in three worksheets took 17 seconds. Loading each worksheet on a separate thread, loaded same data sets in 5 seconds.)
From experience, performance starts to really suffer with workbooks of 40MB or more. Especially, if workbooks contain many formulas. My largest workbook of 120MB takes several minutes to load. Using OLEDB access, I can load, access, and process the same data in a few seconds.
If you want the client to open data in Excel, gather data via ADO.NET/OLEDB, get XML and transform into XMLSS using Xslt. Which is easy and there is much documentation and samples.
If you just want to present the data, gather data via ADO.NET/OLEDB, get XML and transform into HTML using Xslt. Which is easy and there is much documentation and samples.
Be aware that the browser and computer become non-responsive with large data sets. I had to set limit upper limit. If limit was reaced, I notified user of truncated results, otherwise, user thought computer was "locked".
Take a look at this question in StackOverflow:
Create Excel (.XLS and .XLSX) file from C#
I think you can open your workbook on the server (inside your ASP.NET MVC application) and process only the specific worksheet you want. You can then send such worksheet to the user using NPOI.
The following post shows you how to do that using an ASP.NET MVC application:
Creating Excel spreadsheets .XLS and .XLSX in C#
You can't "say" to Excel, even via Interop that you only want a single worksheet. There are a lot of explanations, like formulas, references and links between them, which makes the task impossible.
If you only want to read the data from the worksheet, maybe OLEDB Data Provider is the best option for you. Here is a full example: Reading excel file using OLEDB Data Provider
Otherwise, you will need to load the entire workbook in memory before do anything with it.
How can I 'read' an excel 2003 document stored as a sharepoint spfile? I can retrieve the document from the library with no problems using the SPFile.OpenBinary() and then putting that into a MemoryStream.
The original idea was to use OpenXML to interrogate the document (which will take this object type as a constructor), but the Excel version (2003) prohibits this.
Just to cloud the issue further, there is no guarantee that I will have any Excel version on the host machine, so possibly won't be able to use the interop assemblies either.
Suggestions or solutions will be gratefully received.
When I say read, I mean pull data from named ranges, cell references etc. All of the open source libraries I have found (Exceldatareader, NOPI, OpenXML) have some limitation or another that prohibits their use. e.g. can't load macro enabled sheets
The excel document is loaded into a sharepoint library which exposes this list as a collection of SPFile(s). These files can be read into a MemoryStream simply enough, but most of the libraries I have tried require a filestream constructor, which means writing to the filesystem on the application server
I've not tried SpreadsheetGear, but if there's no footprint on the filesystem, then I'll take a look for sure, but this is not an option on this project. I'll update this thread with my findings...
I'm reduced to using the PIA's. Dirty, dirty, dirty.
SpreadsheetGear for .NET can open a xls and xlsx workbooks from a memory stream with SpreadsheetGear.Factory.GetWorkbookSet().Workbooks.OpenFromStream(System.IO.Stream) and also has the ability to open directly from a byte array with OpenFromMemory(byte[]). Once opened, SpreadheetGear has a comprehensive API, calculation engine, rendering engine and more.
You can see live samples here and download the free trial here.
Disclaimer: I own SpreadsheetGear LLC
I've found this [library] (http://exceldatareader.codeplex.com/) on codeplex which seems to be able to read any Excel version. There might be a lot more on the web
When you say read what exactly do you mean? There seems to be some great debate amongst developers as to what the term's definition is. Either way it shouldn't really matter if Excel is on their system or not, on account of I am only pretty sure that if the person wanted to view the file any way they would need at the very least a reader. So that being said I believe your fear is a moot point and that using a MemoryStream should suffice.