Best way reading from dirty excel sheets - c#

I have to manipulate some Excel documents with C#. It's a batch process with no user interaction. It's going to parse data into a database, then output nice reports. The data is very dirty and cannot be ready using ADO. The data is nowhere near a nice table format.
Best is defined as the most stable(updates less likely to break)/ clear(succinct) code. Fast doesn't matter. If it runs in less than 8 hours I'm fine.
I have the logic to find the data worked out. All I need to make it run is basic cell navigation and getvalue type functions. Give me X cell value as string, if it matches Y value with levenshtein distance < 3, then give me Z cell value.
My question is, what is the best way to dig into the excel?
VSTO?
Excel Objects Library?
Third Option I'm not aware of?

VSTO is kind of a pain because of permissions and the fact that your dll becomes hooked to the document you're using. Assuming you're not actually changing the files, and ADO is definitely not an option, I would say that automation through the Excel COM interfaces is your best bet. It lets you program the way you normally would for any other application, and gives you just as many options for data extraction as VSTO.

The Office programs can be loaded as objects in .NET. The following is the coding stub that I used to load Excel into VB6. The code is essentially going to be the same regardless of which MS language you use.
Dim xlApp As New Excel.Application
Dim wb As Excel.Workbook
Dim ws As Excel.Worksheet
On Error Resume Next
wb = xlApp.Workbooks.Open("c:\testdata.xls")
If Err.Number > 0 Then
If Err.Number = 1004 Then
MsgBox("File not found")
Else
MsgBox("Error " & Err.Number & " occurred.")
End If
Exit Sub
End If
ws = wb.Sheets("Sheet1")
Text1.Text = ws.Cells(1, 1).Value
wb = Nothing
ws = Nothing
xlApp = Nothing

Well try to see stack over flow question Convert Excel Range to ADO.NET DataSet or DataTable, etc

Related

Embed file in excel worksheet/cell

I hope someone can help me. Is there a way to embed a specific file (.txt) into an excel cell? I'm currently using epplus, and I would like to embed programmatically a file into a specific excel cell. I did manage to add a hyperlink, but my goal is to have it embedded.
Worksheet.Cells[rowNumber, colNumber].Value = ....
Is there any way to do it? I couldn't find anything online.
As mentioned in the comments, you can certainly put text within a cell, but bear in mind Excel does have a limit to the number of characters it will allow in a single cell. It's pretty large, but conceivably the contents of a text file could exceed that limit -- even if future versions of Excel keep increasing what the limit is (as they have in the past).
You can also embed an OLE object in your worksheet, and a text file qualifies for that. I don't know that you can assign it to a cell, per se. You can change the location, shape and behavior to fit in a cell and behave as though it's part of a cell, but I don't know that it ever belongs to a range the way formulas do. I could be wrong.
The basic construct of how to embed an OLE object into a worksheet is as follows:
Excel.OLEObject ole = ws.OLEObjects().Add(Filename: #"C:\Users\hambone\Documents\foo.txt");
This is the equivalent of the VBA:
Set ole = sh.OLEObjects.Add(Filename:="C:\Users\hambone\Documents\foo.txt")
The method returns an OLEObject object, which you can then shape to behave the way you want:
ole.Height = 5;

C# Interop Excel Range get_End is returning with 1 less element

I need to parse an Excel file. First I wrote an extension in Visual Basic inside the Excel file, all worked good. Now I need to port it to C# so it can be a separate application. While the functions I use are the same, the result is not the same...
When I choose from the GUI which Worksheet to parse, I do something like:
range = (workbook.Worksheets.get_Item(itemIndex) as Excel.Worksheet).UsedRange;
Then, for the first row I need to parse I do something like:
range.get_Range(range.Cells.get_Item(6, 2),
range.Cells.get_Offset(6,2).get_End(Excel.XlDirection.xlToRight)))
And I get the right result with all the fields I need.
The second time when I need to get another row, I do:
range.get_Range(range.Cells.get_Item(13, 3),
range.Cells.get_Offset(13, 3).get_End(Excel.XlDirection.xlToRight)))
This time it gives me all the elements except the last one. And I have more functions like this, some with XlDirection.xlDown and all of them return me the range without the last element.
I tried to swap the functions, thinking may be I need to release range and then acquire it again or something(wanted to check if it's always working only for the first function being executed) but it is always working only for the first example, whenever the function is being executed...
This is even stranger because it worked in VBA Excel.
I also tired with Excel.Application get_Range and Excel.Worksheet get_Range...
Anyone knows why this happens?
I managed to solve this strange behavior. It's not the correct way of getting out the data.
The correct way would be: range.get_Range(range.Cells[6, 2], (range.Cells[6, 2] as Excel.Range).get_End(Excel.XlDirection.xlToRight)) - for the first example.
Hope it helps somebody...

Fastest way to enumerate or look for all empty Excel cells and change their style or do some other processing

Which would be potentially a best way to enumerate or iterate or simply look for empty cells or cells with specific data structure in Excel, and later once you find it do some processing on it.
I tired Range, Value, Value2, etc but it takes fairly long time when Excel Sheet is considerably larger. I believe there must be some other efficient way.
It would be nice, if you can show some example snippet.
The answer is relativley simple: get the array in one batch from excel (search SO for a how to) - test the values of the erray for empty cells and then acess only the empty cells in excel.
It is somewhat cumbersome, but the fastes way because iterating each cell is vastly slower than simply getting all data in a batch.
To find blank cells, use the .SpecialCells method of a range object.
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.range.specialcells(v=office.11).aspx
The .specialCells method returns a range object of the matching criteria (i.e., xlCellTypeVisible, xlCellTypeBlanks, etc.). You can then iterate of this range to perform your formatting, etc.
Update I'm not a C# programmer, but I can show you how I would do this in VBA. Assuming interop exposes most/all of the same methods and functionality, you should hopefully be able to translate this for your purposes.
Sub ColorVisibles()
Dim rng As Range
Dim rngBlanks As Range
Dim blanksExist As Boolean
'define your range
Set rng = Range("A1:AA300")
'check to make sure there are blank cells in the range:
blanksExist = Application.WorksheetFunction.CountBlank(rng) > 0
If blanksExist Then
Set rngBlanks = rng.SpecialCells(xlCellTypeBlanks)
rngBlanks.Interior.Color = vbYellow
Else:
MsgBox "No blank cells exist in the specified range.", vbInformation
End If
End Sub

Get the formatting of a cell in an Excel 2010 template

Im generating an Excel 2010 report based on a template in code using Microsoft.Office.Interop.Excel;
I need to grab the formatting from a cell in the template and apply it on subsequent cells down the column. To simplify, I want my cells to be right justified. Maybe its faster to specify this explicitly in code but I would prefer if I could base formatting on the template and not hardcode a particular style.
I am using workSheet.get_Range("L2", Type.Missing).get_Resize(1, 1) to select the cell.
Any suggestion greatly appreciated.
I'm not sure where get_Range() comes into the picture. I usually just get a Range directly.
I think what you are after is something to this effect (filler code there so you can see the variable names):
Dim oExcel As Object
Dim oBook As Excel.Workbook
Dim oSheet As Excel.Worksheet
Dim AlignType As Long
oExcel = CreateObject("Excel.Application")
oBook = oExcel.Workbooks.Open("MySheet.xlsx")
oSheet = oBook.Worksheets(1)
AlignType = oSheet.Range("G1").HorizontalAlignment
oSheet.Range("G1:G" & oSheet.Range("G1").End(Excel.XlDirection.xlDown).Row).HorizontalAlignment = AlignType
Change the ranges to suit your own code.
Basically, read the value out (it's an enum, so you don't need to get the actual setting, just its number), and write it back into the other cells. You could probably do it all in one step, I separated it for clarity.

Excel Interop - Efficiency and performance

I was wondering what I could do to improve the performance of Excel automation, as it can be quite slow if you have a lot going on in the worksheet...
Here's a few I found myself:
ExcelApp.ScreenUpdating = false -- turn off the redrawing of the screen
ExcelApp.Calculation = Excel.XlCalculation.xlCalculationManual -- turning off the calculation engine so Excel doesn't automatically recalculate when a cell value changes (turn it back on after you're done)
Reduce calls to Worksheet.Cells.Item(row, col) and Worksheet.Range -- I had to poll hundreds of cells to find the cell I needed. Implementing some caching of cell locations, reduced the execution time from ~40 to ~5 seconds.
What kind of interop calls take a heavy toll on performance and should be avoided? What else can you do to avoid unnecessary processing being done?
When using C# or VB.Net to either get or set a range, figure out what the total size of the range is, and then get one large 2 dimensional object array...
//get values
object[,] objectArray = shtName.get_Range("A1:Z100").Value2;
iFace = Convert.ToInt32(objectArray[1,1]);
//set values
object[,] objectArray = new object[3,1] {{"A"}{"B"}{"C"}};
rngName.Value2 = objectArray;
Note that its important you know what datatype Excel is storing (text or numbers) as it won't automatically do this for you when you are converting the type back from the object array. Add tests if necessary to validate the data if you can't be sure beforehand of the type of data.
This is for anyone wondering what the best way is to populate an excel sheet from a db result set. This is not meant to be a full list by any means but it does list a few options.
Some performance numbers while attempting to populate an excel sheet with 155 columns and 4200 records on an old Pentium 4 3GHz box including data retrieval time which was never more than 10 seconds in order of slowest to fastest is as follows...
One cell at a time - Just under 11 minutes
Populating a dataset by converting to html + Saving html to disk + Loading html into excel and saving worksheet as xls/xlsx - 5 minutes
One column at a time - 4 minutes
Using the deprecated sp_makewebtask procedure in SQL 2005 to create an HTML file - 9 Seconds + Followed by loading the html file in excel and saving as XLS/XLSX - About 2 minutes.
Convert .Net dataset to ADO RecordSet and use the WorkSheet.Range[].CopyFromRecordset function to populate excel - 45 seconds!
I ended up using option 5. Hope this helps.
If you're polling values of many cells you can get all the cell values in a range stored in a variant array in one fell swoop:
Dim CellVals() as Variant
CellVals = Range("A1:B1000").Value
There is a tradeoff here, in terms of the size of the range you're getting values for. I'd guess if you need a thousand or more cell values this is probably faster than just looping through different cells and polling the values.
Use excels builtin functionality whenever possible, for example: Instead of searching a whole column for a given string, use the find command available in the GUI by Ctrl-F:
Set Found = Cells.Find(What:=SearchString, LookIn:=xlValues, _
SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False)
If Not Found Is Nothing Then
Found.Activate
(...)
EndIf
If you want to sort some lists, use the excel sort command, don't do it manually in VBA:
Selection.Sort Key1:=Range("A1"), Order1:=xlAscending, Header:=xlGuess, _
OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal
As Anonymous Type says: reading/writing large range blocks is very important to performance.
In cases where the COM-Interop overhead is still too large you may want to switch to using the XLL interface, which is the fastest Excel interface.
Although the XLL interface is primarily meant for C++ users, both XL DNA and Addin Express provide .NET to XLL bridge capability which is significantly faster than COM-Interop.
Performance also depends a lot on how you automate Excel. VBA is faster than COM automation is faster than .NET automation. And typically early (compile time) binding is faster than late binding, too.
If you have serious performance problems you could think of moving the critical parts of the code to a VBA module and call that code from your COM/.NET automation code.
If you use .NET you should also use the optimized primary interop assemblies available from Microsoft and not use custom-built interop assemblies.
Another big thing you can do in VBA is to use Option Explicit and avoid Variants wherever possible. Variants are not 100% avoidable in VBA, but they make the interpreter do more work at runtime and waste memory.
I found this article very helpful when I was starting with VBA in Excel.
http://www.ozgrid.com/VBA/SpeedingUpVBACode.htm
And this book
http://www.amazon.com/VB-VBA-Nutshell-Language-OReilly/dp/1565923588
Similar to
app.ScreenUpdates = false //and
app.Calculation = xlCalculationManual
you can also set
app.EnableEvents = false //Prevent Excel events
app.Interactive = false //Prevent user clicks and keystrokes
although they don't seem to make as big a difference as the first two.
Similar to setting Range values to arrays, if you are working with data that is mostly tables with the same formula in every row of a column, you can use R1C1 formula notation for your formula and set an entire column equal to the formula string to set the whole thing in one call.
app.ReferenceStyle = xlR1C1
app.ActiveSheet.Columns(2) = "=SUBSTITUTE(C[-1],"foo","bar")"
Also, creating XLL add-ins using ExcelDNA & .NET (or the hard way in C) is also the only way you can get UDFs to run on multiple threads. (See Excel DNA's ExcelFunction attribute's IsThreadSafe property.)
Before I transitioned to Excel DNA completely, I also experimented with creating COM visible libraries in .NET to reference in VBA projects. Heavy text processing is a bit faster than VBA that way, as are using wrapped .NET List classes instead of VBA's Collection, but Excel DNA is better.

Categories

Resources