Excel Interop - Efficiency and performance

Excel Interop - Efficiency and performance - c#

I was wondering what I could do to improve the performance of Excel automation, as it can be quite slow if you have a lot going on in the worksheet...
Here's a few I found myself:
ExcelApp.ScreenUpdating = false -- turn off the redrawing of the screen
ExcelApp.Calculation = Excel.XlCalculation.xlCalculationManual -- turning off the calculation engine so Excel doesn't automatically recalculate when a cell value changes (turn it back on after you're done)
Reduce calls to Worksheet.Cells.Item(row, col) and Worksheet.Range -- I had to poll hundreds of cells to find the cell I needed. Implementing some caching of cell locations, reduced the execution time from ~40 to ~5 seconds.
What kind of interop calls take a heavy toll on performance and should be avoided? What else can you do to avoid unnecessary processing being done?

When using C# or VB.Net to either get or set a range, figure out what the total size of the range is, and then get one large 2 dimensional object array...
//get values
object[,] objectArray = shtName.get_Range("A1:Z100").Value2;
iFace = Convert.ToInt32(objectArray[1,1]);
//set values
object[,] objectArray = new object[3,1] {{"A"}{"B"}{"C"}};
rngName.Value2 = objectArray;
Note that its important you know what datatype Excel is storing (text or numbers) as it won't automatically do this for you when you are converting the type back from the object array. Add tests if necessary to validate the data if you can't be sure beforehand of the type of data.

This is for anyone wondering what the best way is to populate an excel sheet from a db result set. This is not meant to be a full list by any means but it does list a few options.
Some performance numbers while attempting to populate an excel sheet with 155 columns and 4200 records on an old Pentium 4 3GHz box including data retrieval time which was never more than 10 seconds in order of slowest to fastest is as follows...
One cell at a time - Just under 11 minutes
Populating a dataset by converting to html + Saving html to disk + Loading html into excel and saving worksheet as xls/xlsx - 5 minutes
One column at a time - 4 minutes
Using the deprecated sp_makewebtask procedure in SQL 2005 to create an HTML file - 9 Seconds + Followed by loading the html file in excel and saving as XLS/XLSX - About 2 minutes.
Convert .Net dataset to ADO RecordSet and use the WorkSheet.Range[].CopyFromRecordset function to populate excel - 45 seconds!
I ended up using option 5. Hope this helps.

If you're polling values of many cells you can get all the cell values in a range stored in a variant array in one fell swoop:
Dim CellVals() as Variant
CellVals = Range("A1:B1000").Value
There is a tradeoff here, in terms of the size of the range you're getting values for. I'd guess if you need a thousand or more cell values this is probably faster than just looping through different cells and polling the values.

Use excels builtin functionality whenever possible, for example: Instead of searching a whole column for a given string, use the find command available in the GUI by Ctrl-F:
Set Found = Cells.Find(What:=SearchString, LookIn:=xlValues, _
SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False)
If Not Found Is Nothing Then
Found.Activate
(...)
EndIf
If you want to sort some lists, use the excel sort command, don't do it manually in VBA:
Selection.Sort Key1:=Range("A1"), Order1:=xlAscending, Header:=xlGuess, _
OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal

As Anonymous Type says: reading/writing large range blocks is very important to performance.
In cases where the COM-Interop overhead is still too large you may want to switch to using the XLL interface, which is the fastest Excel interface.
Although the XLL interface is primarily meant for C++ users, both XL DNA and Addin Express provide .NET to XLL bridge capability which is significantly faster than COM-Interop.

Performance also depends a lot on how you automate Excel. VBA is faster than COM automation is faster than .NET automation. And typically early (compile time) binding is faster than late binding, too.
If you have serious performance problems you could think of moving the critical parts of the code to a VBA module and call that code from your COM/.NET automation code.
If you use .NET you should also use the optimized primary interop assemblies available from Microsoft and not use custom-built interop assemblies.

Another big thing you can do in VBA is to use Option Explicit and avoid Variants wherever possible. Variants are not 100% avoidable in VBA, but they make the interpreter do more work at runtime and waste memory.
I found this article very helpful when I was starting with VBA in Excel.
http://www.ozgrid.com/VBA/SpeedingUpVBACode.htm
And this book
http://www.amazon.com/VB-VBA-Nutshell-Language-OReilly/dp/1565923588
Similar to
app.ScreenUpdates = false //and
app.Calculation = xlCalculationManual
you can also set
app.EnableEvents = false //Prevent Excel events
app.Interactive = false //Prevent user clicks and keystrokes
although they don't seem to make as big a difference as the first two.
Similar to setting Range values to arrays, if you are working with data that is mostly tables with the same formula in every row of a column, you can use R1C1 formula notation for your formula and set an entire column equal to the formula string to set the whole thing in one call.
app.ReferenceStyle = xlR1C1
app.ActiveSheet.Columns(2) = "=SUBSTITUTE(C[-1],"foo","bar")"
Also, creating XLL add-ins using ExcelDNA & .NET (or the hard way in C) is also the only way you can get UDFs to run on multiple threads. (See Excel DNA's ExcelFunction attribute's IsThreadSafe property.)
Before I transitioned to Excel DNA completely, I also experimented with creating COM visible libraries in .NET to reference in VBA projects. Heavy text processing is a bit faster than VBA that way, as are using wrapped .NET List classes instead of VBA's Collection, but Excel DNA is better.

Related

OutputBuffer not working for large c# list

I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?

Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965

Excel not happily displaying large 2D Range FormulaArray

I have an XLL which returns an LPXLOPER result of type 2D array for a Range with FormulaArray.
Things go happily <1s update until I hit about size 50x200. At that point, Excel gets stuck blinking "Ready (pretty Excel graphic)" and "Filling cells (empty progress bar)" at 100% usage of 1 core which goes on for less than half a minute before returning values.
At 100x100 it takes 8-10 minutes.
At 200x100 I'm still waiting for it to return.
The code is identical in all cases. I step through the VB and it hangs on calling RUN(...) to populate the data array. No further code is executed. I put breakpoints in my XLL and it doesn't hit any of them. I break into Excel and it's doing Excel stuff in EXCEL.EXE or in libraries I didn't even know existed.
Anyone know (a) what Excel is doing when it says Ready / Filling Cells even though it is obviously NOT ready, and (b) why the nonlinear growth wrt data size?

I tested an XLL returning a 200x100 array and its virtually instantaneous. So the problem must be either dependency building or calculation or erasing/buildng the cell table.
Try
- switching calculation to manual to turn off calculation
- setting forcefullcalculation to true to switch off dependency building
- test with an empty workbook to see if it is caused by the workbook contents

Easy way to increment cell numbers in Excel

I'm an intermediate C# programmer, but I'm just starting out with Office automation, specifically Excel for now. I've got to say, the Office API is lacking, or at least it forces you to think about problems differently. One thing that's driving me nuts is cell numbers, such as A1 and B5 and so on. I'm forced to manipulate them often, but there's no easy way to do this. For example, if I'm on column C7 and want to copy or move something to B7, I can't just use --C7. Instead I have to figure out the numerical value of C, decrement it, turn it back into a letter then concatenate it with the row number again.
I could write methods to do this myself (e.g. decrementColumn(), decrementRow(), addColumns( String currentCellName, int howManyToAdd) ), but I don't want to reinvent the wheel. Does a library of functions exist for such oft-needed conversions or am I going to have to roll my own?

To copy/move values easily, you can use the .Offset method, which returns a Range.
For example, if the range/cell you are working with is C7, where rng represents this Range object:
rng.Offset(0,-1).Value = rng.Value
This returns the range, offset by -1 colums.
rng.Offset(10,15) would return a cell/range 10 rows below, and 15 columns right, etc.
You may also look at R1C1 address style in Excel, although I have never been fond of that. This link for Excel 2007 but should be mostly appropriate for any version of Excel.
http://msdn.microsoft.com/en-us/library/office/ee264226(v=office.12).aspx

Excel Recalculation

I am using the excel pia's to do some writing and reading to/from excel spreadsheets, i may just be being paranoid but i have the following questions:
As far as i can tell Excel recalculates the formulas in the worksheet upon every write but...
is this the case? - ie is it possible to do series of write read write read and not to read the correct recalculations (eg if its a complex formula and takes too long could i end up reading a value that has not been recalculated yet?)
is there anyway to do something like:
BeginUpdate(); write lots of values EndUpdate(); Recalculate(); readlotsofvalues ?
I have not seen any dodgy results but i would like to be able to know "for sure" ;)

Some VBA functions that will work are here, to use these you can use the SpreadsheetClass in Interop.
For C#, you have the Calculate() function.

Does anyone have .Net Excel IO component benchmarks?

I'm needing to access Excel workbooks from .Net. I know all about the different ways of doing it (I've written them up in a blog post), and I know that using a native .Net component is going to be the fastest. But the question is, which of the components wins? Has anybody benchmarked them? I've been using Syncfusion XlsIO, but that's very slow for some key operations (like deleting rows in a workbook containing thousands of Named ranges).

I haven't done any proper benchmarks, but I tried out several other components,and found that SpreadsheetGear was considerably faster than XlsIO which I was using before. I've written up some of my findings in this post

Can't help you with your original question, but are you aware that you can access Excel files using an OleDbConnection, and therefore treat it as a database? You can then read worksheets into a DataTable, perform all the changes you need to the data in your application, and then save it all back to the file using an OleDbConnection.

Yes but I'm not going to publish them both out of a courtesy to Syncfusion (they ask you not to publish benchmarks), because I'm not an experienced tester so my tests are probably somewhat flawed but mostly because what you actually benchmark makes a huge difference to who wins and by how much.
I took one of their "performance" examples and added the same routine in EPPlus to compare them. XLSIO was around 15% faster with just straightforward inserts, depending on the row/column ratio (I tried a few), memory usage seemed very similar. When I added a routine that, after all the rows were added, deleted every 10th row and then inserted a new row 2 rows up from that - XLSIO was significantly slower in that circumstance.
A generic benchmark is pretty-much useless to you. You need to try them against each other in the specific scenarios you use.
I have been using EPPlus for a few years and the performance has been fine, I don't recall shouting at it.
More worthy of your consideration is the functionality, support (Syncfusion have been good, in my experience), Documentation, access to the source code if that is important, and - importantly - how much sense the API makes to you, the syntax can be quite different. eg. Named Styles
XLSIO
headerStyle.BeginUpdate();
workbook.SetPaletteColor(8, System.Drawing.Color.FromArgb(255, 174, 33));
headerStyle.Color = System.Drawing.Color.FromArgb(255, 174, 33);
headerStyle.Font.Bold = true;
headerStyle.Borders[ExcelBordersIndex.EdgeLeft] .LineStyle = ExcelLineStyle.Thin;
headerStyle.Borders[ExcelBordersIndex.EdgeRight] .LineStyle = ExcelLineStyle.Thin;
headerStyle.Borders[ExcelBordersIndex.EdgeTop] .LineStyle = ExcelLineStyle.Thin;
headerStyle.Borders[ExcelBordersIndex.EdgeBottom].LineStyle = ExcelLineStyle.Thin;
headerStyle.EndUpdate();
EPPlus
ExcelNamedStyleXml headerStyle = xlPackage.Workbook.Styles.CreateNamedStyle("HeaderStyle");
headerStyle.Style.Fill.PatternType = ExcelFillStyle.Solid; // <== needed or BackgroundColor throws an exception
headerStyle.Style.Fill.BackgroundColor.SetColor(System.Drawing.Color.FromArgb(255, 174, 33));
headerStyle.Style.Font.Bold = true;
headerStyle.Style.Border.Left.Style = ExcelBorderStyle.Thin;
headerStyle.Style.Border.Right.Style = ExcelBorderStyle.Thin;
headerStyle.Style.Border.Top.Style = ExcelBorderStyle.Thin;
headerStyle.Style.Border.Bottom.Style = ExcelBorderStyle.Thin;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Excel Interop - Efficiency and performance - c#

Related

OutputBuffer not working for large c# list

Excel not happily displaying large 2D Range FormulaArray

Easy way to increment cell numbers in Excel

Excel Recalculation

Does anyone have .Net Excel IO component benchmarks?

Categories

Resources