IEnumerable<T> to Excel (2007) w/ Formatting - c#

I'm looking for a good way to export an IEnumerable to Excel 2007 (.xlsb).
The T is a known type, so reflection is not completely necessary for performance reasons.
I'm using .xlsb (excel binary format) because the amount of data will be large for Excel.
The IEnumerable in question has approximately 2 million records. The IEnumerable is retrieved from an Access database (.mdb) then goes through some processing, then finally LINQ queries are wrote to generate a report structure for T. Though these records do not need to get sent to excel as one (nor could it); it will be sub-divided by a condition to which the largest record length will be roughly 1 million records.
I want to be able to convert the data to an Excel Pivot Table for easy viewing.
My initial idea was to convert the IEnumerable to a 2Darray [,] then push into an Excel range using COM interop.
public static object[,] To2DArray<T>(this IEnumerable<T> objectList)
{
Type t = typeof(T);
PropertyInfo[] fields = t.GetProperties();
object[,] my2DObject = new object[objectList.Count(), fields.Count()];
int row = 0;
foreach (var o in objectList)
{
int col = 0;
foreach (var f in fields)
{
my2DObject[row, col] = f.GetValue(o, null) ?? string.Empty;
col++;
}
row++;
}
return my2DObject;
}
I then took that object[,] and did a "transaction split" as I called it which just split up the object[,] into smaller chunks such as I'd create a List then go through each one and send into Excel range using something similar to:
Excel.Range range = worksheet.get_Range(cell,cell);
range.Value2 = List<object[,]>[0]
I'd obviously loop the above but just for simplicity it would look like the above.
This will work though, it takes an enormous amount of time to process, over 30minutes.
I've dabbled in outputting the IEnumerable to CSV though, it is not very efficient either; since it first requires the .csv file to be created, then open the .csv file using COM interop to do the excel pivot table formatting.
My question: Is there a better (preferred) way to do this?
Should I force execution (toList()) before iteration?
Should I use a different mechanism to output/display the data?
I'm open to any options to get a disconnected IEnumerable out to file in an efficient manner.
-I wouldn't be opposed to using something like SQL Express.

The main question will be where the bottleneck is. I'd have a look at the code in a profiler to see what part of the execution is taking a long time. It can also be worthwhile looking at your resource usage by running the process and seeing whether there is a shortage of CPU or Memory, or whether it's disk-locked.
If you're getting sensible performance doing 2000 records at a time, then I suspect memory resources may be an issue - with the code you posted you're converting an IEnumerable (which can avoid loading a complete dataset into memory) into an entirely in-memory structure with potentially a million records - depending on the size and number of fields involved, this could easily become an issue.
If the problem looks like the time to create the Excel file itself (which it doesn't immediately sound like it is in this case), then COM interop calls can add up, and some of the 3rd party Excel libraries aim to be much faster at writing Excel files, particularly with large numbers of records, so rather than necessarily use Excel Binary format and COM, I'd suggest looking at an Open Source library like EPPlus (http://epplus.codeplex.com/) and seeing what the performance difference is like.

Related

Populating FarPoint Spread with huge chunk of Data (64-bit Spreadsheet Issue)

In C# 64-bit, i am trying to populate a FarPoint Spreadsheet with approx 70000 rows. The entire data gets loaded on spredsheet after taking 3-4 hours of time duration, which makes the entire process to have lot of performance issues.
Currently i am populating the data to spreadsheet by individual cells.Is there anything i can do in order to increase the performance of this issue i am facing??
Below is my code template to populate the spreadsheet by individual cells.
Public void PopulateSpreadsheet()
{
FarPoint.Win.Spread.FpSpread SS;
SS.SuspendLayout();
int i = 0;
int Rows = 70000;
while( i < Rows)
{
SS.ActiveSheet.ActiveCell.Text = Data to populate;
}
SS.ResumeLayout();
}
Please guide me how to improve the performance. Any Help is appreciated!! Thank You in Advance :)
Well initially, I was storing the data directly on the spreadsheet which would take lot of time. To overcome this issue, I first stored my data to a datatable and then make this datatable as data source to the spreadsheet.
Put the data into an object array and use the .SetArray property. This is extremely fast and also allows different data types in the array.
I use this often, especially when the data come from Sql.

C# Excel Reading optimization

My app will build an item list and grab the necessary data (ex: prices, customer item codes) from an excel file.
This reference excel file has 650 lines and 7 columns.
App will read rows of 10-12 items in one run-time.
Would it be wiser to read line item by line item?
Or should I first read all line item in the excel file into a list/array and make the search from there?
Thank you
It's good to start by designing the classes that best represent the data regardless of where it comes from. Pretend that there is no Excel, SQL, etc.
If your data is always going to be relatively small (650 rows) then I would just read the whole thing into whatever data structure you create (your own classes.) Then you can query those for whatever data you want, like
var itemsIWant = allMyData.Where(item => item.Value == "something");
The reason is that it enables you to separate the query (selecting individual items) from the storage (whatever file or source the data comes from.) If you replace Excel with something else you won't have to rewrite other code. If you read it line by line then the code that selects items based on criteria is mingled with your Excel-reading code.
Keeping things separate enables you to more easily test parts of your code in isolation. You can confirm that one component correctly reads what's in Excel and converts it to your data. You can confirm that another component correctly executes a query to return the data you want (and it doesn't care where that data came from.)
With regard to optimization - you're going to be opening the file from disk and no matter what you'll have to read every row. That's where all the overhead is. Whether you read the whole thing at once and then query or check each row one at a time won't be a significant factor.

Reading large Excel file with Interop.Excel results in System.OutOfMemoryException

I followed this very promising link to make my program read Excel files, but the problem I get is System.OutOfMemoryException. As far as I can gather, it happens because of this chunk of code
object[,] valueArray = (object[,])excelRange.get_Value(
XlRangeValueDataType.xlRangeValueDefault);
which loads the whole list of data into one variable. I do not understand why the developers of the library decided to do it this way, instead of making an iterator, that would parse a sheet line by line. So, I need some working solution that would enable to read large (>700K rows) Excel files.
I am using the following function in one of my C# applications:
string[,] ReadCells(Excel._Worksheet WS,
int row1, int col1, int row2, int col2)
{
Excel.Range R = WS.get_Range(GetAddress(row1, col1),
GetAddress(row2, col2));
....
}
The reason to read a Range in one go rather than cell-by-cell is performance.
For every cell access, a lot of internal data transfer is going on. If the Range is too large to fit into memory, you can process it in smaller chunks.

Intersecting 2 big datasets

I have a giant (100Gb) csv file with several columns and a smaller (4Gb) csv also with several columns. The first column in both datasets have the same category. I want to create a third csv with the records of the big file which happen to have a matching first column in the small csv. In database terms it would be a simple join on the first column.
I am trying to find the best approach to go about this in terms of efficiency. As the smaller dataset fits in memory, I was thinking of loading it in a sort of set structure and then read the big file line to line and querying the in memory set, and write to file on positive.
Just to frame the question in SO terms, is there an optimal way to achieve this?
EDIT: This is a one time operation.
Note: the language is not relevant, open to suggestions on column, row oriented databases, python, etc...
Something like
import csv
def main():
with open('smallfile.csv', 'rb') as inf:
in_csv = csv.reader(inf)
categories = set(row[0] for row in in_csv)
with open('bigfile.csv', 'rb') as inf, open('newfile.csv', 'wb') as outf:
in_csv = csv.reader(inf)
out_csv = csv.writer(outf)
out_csv.writerows(row for row in in_csv if row[0] in categories)
if __name__=="__main__":
main()
I presume you meant 100 gigabytes, not 100 gigabits; most modern hard drives top out around 100 MB/s, so expect it to take around 16 minutes just to read the data off the disk.
If you are only doing this once, your approach should be sufficient. The only improvement I would make is to read the big file in chunks instead of line by line. That way you don't have to hit the file system as much. You'd want to make the chunks as big as possible while still fitting in memory.
If you will need to do this more than once, consider pushing the data into some database. You could insert all the data from the big file and then "update" that data using the second, smaller file to get a complete database with one large table with all the data. If you use a NoSQL database like Cassandra this should be fairly efficient since Cassandra is pretty good and handling writes efficiently.

How can I export very large amount of data to excel

I'm currently using EPPlus to export data to excel. It works admirably for small amount of data. But it consume a lots of memory for large amount of data to export.
I've briefly take a look at OOXML and/or the Microsoft Open XML SDK 2.5. I'm not sure I can use it to export data to Excel?
There is also third party provider libraries.
I wonder what solution could do the job properly of exporting very large amount of data in good performance and not taking to much spaces (ideally less than 3x the amount of data to export) ?
Update: some extra requirements...
I need to be able to export "color" information (that exclude CSV) and I would like something easy to manage like EPPlus library (exclude the XML format itself). I found another thread and they recommend Aspose or SpreadsheetGear which I'm trying. I put first answer as ok. Thanks to all.
Update 2016-02-16 Just as information... We now use SpreadSheetGear and we love it. We required support once and it was awesome.
Thanks
EPPlus to export data to excel. It works admirably for small amount of data. But it consume a lots of memory for large amount of data to export.
A few years ago, I wrote a C# library to export data to Excel using the OpenXML library, and I faced the same situation.
It worked fine until you started to have about 30k+ rows, at which point, the libraries would be trying to cache all of your data... and it'd run out of memory.
However, I fixed the problem by using the OpenXmlWriter class. This writes the data directly into the Excel file (without caching it first) and is much more memory efficient.
And, as you'll see, the library is incredibly easy to use, just call one CreateExcelDocument function, and pass it a DataSet, DataTable or List<>:
// Step 1: Create a DataSet, and put some sample data in it
DataSet ds = CreateSampleData();
// Step 2: Create the Excel .xlsx file
try
{
string excelFilename = "C:\\Sample.xlsx";
CreateExcelFile.CreateExcelDocument(ds, excelFilename);
}
catch (Exception ex)
{
MessageBox.Show("Couldn't create Excel file.\r\nException: " + ex.Message);
return;
}
You can download the full source code for C# and VB.Net from here:
Mike's Export to Excel
Good luck !
If your requirements are simple enough, you can just use CSV.
If you need more detail, look into SpreadsheetML. It's an XML schema that you can use to create a text document that Excel can open natively. It supports formulas, multiple worksheets per workbook, formatting, etc.
I second using CSV but note that Excel has limits to the number of rows and columns in a worksheet as described here:
http://office.microsoft.com/en-us/excel-help/excel-specifications-and-limits-HP010342495.aspx
specifically:
Worksheet size 1,048,576 rows by 16,384 columns
This is for Excel 2010. Keep these limits in mind when working with very large amounts of data.
As an alternative you can use my SwiftExcel library. It was design for high volume Excel output that writes data directly to the file with no memory impact.
Here is a sample of usage:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}

Categories

Resources