OLE DB vs OPEN XML SDK vs Excel.interop - c#

I need to read XLSX files and extract a maximum amount of content from it. Which of the API's should I use?
OLE DB, open XML SDK, or Excel Interop?
Which is the easiest to use?
Can you retrieve all the information using one or the other? i.e, date, times, merged cells, tables, pivottables, etc.

You can try all of them and choose the one that fits you most...
Depending on data you want to read, I'd suggest you to use Open XML over Interop or Ole DB.
I don't know an open XML SDK, although I've some experience with EPPlus library which I'm using a lot and can say only good words about it - it's fast, easy to learn, with good examples. The library is basing on Open Office XML format, so I suppose it's pretty much the same as the SDK you've mentioned, and is capable of easy read and write Excel 2007 and 2010 files.
On the linked web, you'll find a library itself, documentation and some example "Hello World" projects to download.
Why that library in the first place? Because with it you will be able to read not only cells values, but also their colors, fonts, widths and heights, merging and all that detailed stuff, that you can not only read, but modify as well. What's more, you don't need the Excel installed to do that.
On the second place - just in case you need to extract tabular data from worksheet - you may play with OLE DB. I'm afraid with that you won't be able to extract any info about formats, colors etc., as well as the data must be in a tabular organized worksheet, so you can treat is as a database's table.
The last one is Interop, because:
- it's a COM library, so you need to be very careful when playing with it via .NET, as it's easy to cause some ugly and hard to find memory leaks (confirmed by myself bad experience) - if you don't dispose their objects properly, it leaves the Excel.exe process opened,
- it's much slower than previous methods,
- basically, it has almost no more added value that one of the previous methods (EPPlus or OleDB) and requires Excel to be installed on client's machine, so why to use it?
Good luck, then.

Related

Any easy way to embed files (column / row anchored) using EPPlus or POI-like or other OpenXml based libraries?

I have been using E-IceBlue's Spire.XLS library (License Purchase Page | nuget Package), and while it is excellent I've hit a couple of hurdles.
The gist of my requirement is this:
I have to take a bunch of data from our intranet CMS, along with attachments users have uploaded to it, and email that information to a third-party outside of the company, periodically. We were originally sending the data and the user-uploaded attachments separately, but as the documents became more numerous and unwieldy - I then got the request to try and combine everything in to one file. The attachments were small enough to embed, so I achieved this by creating an Excel report using Spire.XLS -- which allows me to not only add OleObjects to the package, but to position them (anchor them) to a specific row or column as well - maintaining a nice visual link with the data from the CMS record. As such I can have all my data on a row in columns A through AB for example, and the attachments start appearing right at the end of the row in columns AC, AD etc.
In terms of how I implemented that - I grab my data from the CMS, iterate through each item (which includes attachment / File data), I get the default image / icon for the relevant file-type, create an OleObject on the Worksheet and then I position it -- something a bit like this:
MyAttachmentCollection attachments = GetAttachments(itemId);
foreach(File attachment in attachments) {
string fileType; string localFilePath;
// Use WebClient to download file locally..
/* --- pseudo-code omitted for brevity -- */
worksheet.OleObjects.Add(localFilePath, image, OleLinkType.Embed);
worksheet.OleObjects.Last().Location = worksheet.Range[row,col + 1];
worksheet.OleObjects.Last().ObjectType = fileType;
col++;
}
Nice and simple and the result is pretty good. Sadly, the success of it means that the powers that be have wanted to send more and more data this way, without ponying up the cash for a Spire.XLS license. The free license only allows 200 rows of data, or 5 worksheet tabs. This is a single-use case for us, so I think they're finding it hard to justify the license cost for this single development and its' future upkeep. We're public services too, so budget wise we have to try and do things on the cheap!
I'm aware that XLSX / Open XML spreadsheet documents are basically zipped/packaged storage containers, so I've taken a look at the contents of an Excel file that contains some attachments added in this way, and I've tried to go about understanding the various schemas and how I might replicate the effect, but I'm struggling to wrap my head around it to be honest and I'm wondering if any other libraries might exist that do this sort of leg work already?
One of the things I love about EPPlus (Old Codeplex Page | nuget Package) is being able to take a DataSet or DataTable and insert that directly in to a worksheet at a given cell reference. I also like that I can use built in Excel styles or define my own and apply those. I can create really lovely looking spreadsheets (sad I know!) while writing very little code. So initially, I looked in to whether I might be able to use or extend EPPlus... And as described in this answer, EPPlus does expose the underlying XML, but from what I can figure - I'd need to:
add the icon/image data to the package first (the actual visual representation of the file in the worksheet) and make that live in the drawings and/or media folder within the XLSX,
the drawing data would need to exist in the new format and the legacy (VML) format (unless Spire XLS is just being overly backwards-compat friendly?? Side note: -- I believe if you use the Office SDK / Excel Interop DLLs you can call for the image information to be generated - but as this is a server based solution I'm looking to avoid that if possible),
I would need to register relationship IDs for those in various XML files,
add the attachment as a BIN file (assuming that's just a binary dump?) and create a relationship ID for that,
and then somehow tie all that together in my worksheet XML...
...headache inducing! Unfortunately I'm not really au-fait with the OpenXML-SDK and I'm not sure how quickly I could pick it up. There's a very real risk I could put a lot of effort in, only to end up with a corrupt / non-compliant file. Unless all of this just seems more complicated than it really is??
The other library that I have used before is NPOI (GitHub repo | nuget Package) -- this is based on Java POI a Java API for Microsoft documents. It supports the older Microsoft Office formats as well as the newer ones.
I've seen some SO answers such as this one which indicate its possible to use POI to embed other MS family documents, but I don't know if the .NET fork (NPOI) is fully implementing this stuff. I've found very little evidence of people doing this using that particular library... it may just be that this requirement is somewhat rare so I can't find examples?
Another example of someone solving the embed problem in Java's POI is here - but that appears to be writing in the older office format and using OLE1.0 embeds.
Just posting as I figure it may be possible one of you super helpful guys out there has done exactly this sort of thing before! ;)
Thank you for reading, and sorry if I've been a bit verbose / wasted too much of your time with the wall of text! Any help greatly appreciated!

How to import the data from an Excel spreadsheet so it can be manipulated in C#

I have an excel spread sheet (well, hundreds of them) which I need importing into a database.
If the excel data was in a nice uniform format I would simply save them out to CSV, read them in using something like LINQ to CSV and save the required data away.
However, the excel spread sheet is 'uneven' in that different groups of cells contain different data.
I need a way of grabbing the data and then working with cell references to grab the bits I need and save them to the database.
What's the best way to achieve this?
Thanks
UPDATE some more information
I have numerous spread sheets, all identical in structure that need to be imported into a database. The import is not simple in that different chunks of data from the spread sheet will go into different tables. The excel document itself contains a few sections (basically question / answer) type data. For each different section I need to grab the data, shape it into a form that makes sense in terms of the database and save it.
Ideally I would like to create a quick little WPF app that will let me select a spread sheet hit a button and perform the import.
You could use the Excel Object Model to read the data if you do it in a non web environment.
See for example How to automate Microsoft Excel from Microsoft Visual C#.NET.
If it has to be inside a web application. I suggest to use Aspose Cells.
Turn the Excel Spread sheet into an ODBC (Open Database Connectivity) Data Source so you can access it just like you would any database:
http://www.datamystic.com/datapipe/excel_odbc.html
Then access it just like any database using ODBC:
http://msdn.microsoft.com/en-us/library/system.data.odbc.odbcconnection(v=vs.71).aspx
When the data is not uniform, it is often better to keep your approach as simple as possible in the first instance. Start with vba and the "Range" object (which is part of the excel object heirarchy). From there you can increase the level of automation and in most instances reuse this "Range" work.
avariable = Range("A2:A5")
That notiation is not going to change very much. It wont matter what final target language you use (language: C# / vba / etc).
There are a number of other ways of going about this -- java based / xml based / c# based / and a few other really cool ones that only apply to certain niche situations. If you can provide more information about your use case, then perhaps I can suggest some more things to try.
Q & A
example link for automation from C#: http://support.microsoft.com/kb/302084
You should probably take a look at Microsoft's Visual Studio Tools For Office (VSTO), which handles a lot of the unpleasant COM/interop stuff for you.
To those who may be interested I ended up using LinqToExcel:
http://code.google.com/p/linqtoexcel/
Did exactly what I was after with minimal fuss. Excellent

Reading Word Documents stored in Oracle DB as a BLOB object using C#

We store a word document in an Oracle 10g database as a BLOB object. I want to read the contents (the text) of this word document, make some changes, and write the text alone to a different field in a C# code.
How do I do this in C# 2.0?
The easiest logic that I came up with is this -
Read the BLOB object
Store it in the FileSystem
Extract the text contents
Do your job
Write the text into a separate field.
I can use Word.dll but not any commercial solutions such as Aspose
I assume that you already know how to do steps 1 and 2 (use the Oracle.DataAccess and System.IO namespaces).
For step 3 and 5, use Word Automation. This MS support article shows you how to get started: How to automate Microsoft Word to create a new document by using Visual C#
If you know what version of Word it will be, then I'd suggest using early binding, otherwise use late binding. More details and sample code here: Using early binding and late binding in Automation
Edit: If you don't know how to use BLOBs from C#, take a look here: How to: Read and Write BLOB Data to a Database Table Through an Anonymous PL/SQL Block
This keeps coming up in my searches, so I'll add an answer for the benefit of future readers.
I highly recommend avoiding Word automation. It's painfully slow and subjects you to the whims of Microsoft's developers with each upgrade. Instead, process the files manually yourselves if you can. The files are nothing but zipped archives of XML files and resources (such as images embedded in the document).
In this case, you'd simply unzip the docx using your preferred library, manipulate the XML, and then zip the result back up.
This does require the use of docx files rather than doc files, but as the link above explains, this has been the default Word format since Office 2007 and shouldn't present an issue unless your users are desperately clinging to the past.
For an example of the time savings, Back in 2007 we converted one process that took 45 minutes using Word automation and, on the same hardware, it took 15 SECONDS processing the files manually. To be clear, I'm not blaming Microsoft for this - their Word automation methods don't know how you will manipulate the document, so they have to anticipate and track everything that you could possibly change. You, on the other hand, can write your method with laser focus because you know exactly what you want to do.

Reading data from an Excel document stored in Sharepoint?

How can I 'read' an excel 2003 document stored as a sharepoint spfile? I can retrieve the document from the library with no problems using the SPFile.OpenBinary() and then putting that into a MemoryStream.
The original idea was to use OpenXML to interrogate the document (which will take this object type as a constructor), but the Excel version (2003) prohibits this.
Just to cloud the issue further, there is no guarantee that I will have any Excel version on the host machine, so possibly won't be able to use the interop assemblies either.
Suggestions or solutions will be gratefully received.
When I say read, I mean pull data from named ranges, cell references etc. All of the open source libraries I have found (Exceldatareader, NOPI, OpenXML) have some limitation or another that prohibits their use. e.g. can't load macro enabled sheets
The excel document is loaded into a sharepoint library which exposes this list as a collection of SPFile(s). These files can be read into a MemoryStream simply enough, but most of the libraries I have tried require a filestream constructor, which means writing to the filesystem on the application server
I've not tried SpreadsheetGear, but if there's no footprint on the filesystem, then I'll take a look for sure, but this is not an option on this project. I'll update this thread with my findings...
I'm reduced to using the PIA's. Dirty, dirty, dirty.
SpreadsheetGear for .NET can open a xls and xlsx workbooks from a memory stream with SpreadsheetGear.Factory.GetWorkbookSet().Workbooks.OpenFromStream(System.IO.Stream) and also has the ability to open directly from a byte array with OpenFromMemory(byte[]). Once opened, SpreadheetGear has a comprehensive API, calculation engine, rendering engine and more.
You can see live samples here and download the free trial here.
Disclaimer: I own SpreadsheetGear LLC
I've found this [library] (http://exceldatareader.codeplex.com/) on codeplex which seems to be able to read any Excel version. There might be a lot more on the web
When you say read what exactly do you mean? There seems to be some great debate amongst developers as to what the term's definition is. Either way it shouldn't really matter if Excel is on their system or not, on account of I am only pretty sure that if the person wanted to view the file any way they would need at the very least a reader. So that being said I believe your fear is a moot point and that using a MemoryStream should suffice.

How can I programmatically create, read, write an excel without having office installed?

I'm confused as hell with all the bazillion ways to read/write/create excel files. VSTO, OLEDB, etc, but they all seem to have the requirement that office must be installed.
Here is my situation: I need to develop an app which will take an excel file as input, do some calculations and create a new excel file which will basically be a modification of the first excel file. All with the constraint that the machine that runs this may not have office installed. (Don't ask why...)
I need to support all excel formats. The only saving grace is that the formats spreadsheets themselves are really simple. Just a bunch of columns and values, nothing fancy. And unfortunately no CSV as the end user might not even know what a CSV file is.
write your excel in HTML table format:
<html>
<body>
<table>
<tr>
<td style="background-color:#acc3ff">Cell1</td>
<td style="font-weight:bold">Cell2</td>
</tr>
</table>
</body>
</html>
and give your file an xls extension. Excel will convert it automatically
Without Office installed you'll need something designed to understand the Excel binary file format (unless you only want to open Office 2007 .xlsx files).
The best I've found (and that I use) is SpreadsheetGear, which in addition to being .NET native, is much faster and more stable then the COM/OLE solutions (which I've used in the past)
read and write csv files instead. Excel reads them just fine and they're easier to use. If you need to work against .xls files then try having support for OpenOffice as well as Excel. OpenOffice can read and write excel files.
Did you consider way number bazillion and one: using the Open XML SDK? You can retain styles and tweak it to your liking. Anything you can do in an actual file is possible to achieve programatically. The SDK comes with a tool called Document Reflector that shows the underlying XML and even shows LINQ statements that can be used to generate them. That is key to playing around with it, seeing how the changes are made, then recreating that in code.
The only caveat is this will work for the new XML based formats (*.xlsx) not the older versions. There's a slight learning curve but more material is making its way on blogs and other sites.
If cost is not an issue, I'd suggest looking in Aspose's Excel product. I use their Word product and I've been satisfied.
Aspose.Cells
Excel XLSX files "just" XML files - more precisely ZIP files containing several XML files. Just rename a Excel file Test.xslx to Test.zip and open it with your favourit ZIP program. XML schemas are, afaik, standardized and availiable. But I think it might not be that easy to manipulate them only using primitive XML processiing tools and frameworks.
Excel files are in a proprietary format so (afaik) you're not going to be able to do this without having the office interop available. Some third party tools exist (which presumably licence the format from MS?) but I've not used them myself to comment on their usefulness.
I assume that you can't control the base file format, i.e. simple CSV or XML formats aren't going to be possible?
I used to use a very nice library called CarlosAg, which uses Excel XML format. It was great (and Excel recognizes the format), and also incredibly fast. Check it out here.
Oh, as a side note, we used to use this for the very same reason you need it. The servers that generated these files were not able to have Excel installed.
If you cannot work with CSV files as per #RHicke's suggestion, and assuming you are working on a web app, since a desktop app would be guaranteed to have XL installed as per requirements.
I'd say, create your processing app as a webservice, and build an XL addin which will interact with your webservice directly from XL.
For XLSX files, look at using http://www.codeplex.com/ExcelPackage. Otherwise, some paid 3rd party solutions are out there, like the one David suggested.
I can understand the requirement of not having office installed on a server machine.
There are many libraries like aspose being available, some of them requiring license though.
If you are targeting MS Excel formats, then a native, Interoperability library, ACE OLEDB data provider, from Microsoft is available which you can install on a machine and start reading, writing programmatically. You need to define a connection string and commands as per you needs. (Ref: This article #yoursandmyideas)talks about using this library along with setup and troubleshooting information.

Categories

Resources