Mail Merge using the Open XML SDK

Mail Merge using the Open XML SDK - c#

I have a DataTable with 3 columns (a,b,c) and a docx file with the corresponding MailMerge fields set up. What I'd like to do is perform a Mail Merge on the document with the data.
Presume that you can write to the hard disk (if you need to create a csv etc for doing the merge etc), you don't have word, excel etc, Open XML SDK is installed, but equally we can install anything else.
In terms of the answer, converting the input data to whatever is needed isn't really a problem, the problem is how to perform a mail merge in the Open XML SDK (or other FREE API).
As a side note, the output should be one file with n pages (where n is the number of rows in the data), i.e. not n documents (though I don't mind if the merge of documents is done at the end).
(I should add, I'm not tied to the MailMerge concept, being able to just do a replace for example would work - though obviously that then requires merging the files together at the end...)

I've got this working in a pretty horrible way - basically - at the moment, the algorithm follows this:
Unzip docx file
Read in document.xml (to a string)
string.Replace the fields
Rezip to temporary docx
Merge all the temporary documents created
The actual code for merging the documents comes from Eric White's Blog : http://blogs.msdn.com/ericwhite/archive/2009/02/05/move-insert-delete-paragraphs-in-word-processing-documents-using-the-open-xml-sdk.aspx

Related

Any easy way to embed files (column / row anchored) using EPPlus or POI-like or other OpenXml based libraries?

I have been using E-IceBlue's Spire.XLS library (License Purchase Page | nuget Package), and while it is excellent I've hit a couple of hurdles.
The gist of my requirement is this:
I have to take a bunch of data from our intranet CMS, along with attachments users have uploaded to it, and email that information to a third-party outside of the company, periodically. We were originally sending the data and the user-uploaded attachments separately, but as the documents became more numerous and unwieldy - I then got the request to try and combine everything in to one file. The attachments were small enough to embed, so I achieved this by creating an Excel report using Spire.XLS -- which allows me to not only add OleObjects to the package, but to position them (anchor them) to a specific row or column as well - maintaining a nice visual link with the data from the CMS record. As such I can have all my data on a row in columns A through AB for example, and the attachments start appearing right at the end of the row in columns AC, AD etc.
In terms of how I implemented that - I grab my data from the CMS, iterate through each item (which includes attachment / File data), I get the default image / icon for the relevant file-type, create an OleObject on the Worksheet and then I position it -- something a bit like this:
MyAttachmentCollection attachments = GetAttachments(itemId);
foreach(File attachment in attachments) {
string fileType; string localFilePath;
// Use WebClient to download file locally..
/* --- pseudo-code omitted for brevity -- */
worksheet.OleObjects.Add(localFilePath, image, OleLinkType.Embed);
worksheet.OleObjects.Last().Location = worksheet.Range[row,col + 1];
worksheet.OleObjects.Last().ObjectType = fileType;
col++;
}
Nice and simple and the result is pretty good. Sadly, the success of it means that the powers that be have wanted to send more and more data this way, without ponying up the cash for a Spire.XLS license. The free license only allows 200 rows of data, or 5 worksheet tabs. This is a single-use case for us, so I think they're finding it hard to justify the license cost for this single development and its' future upkeep. We're public services too, so budget wise we have to try and do things on the cheap!
I'm aware that XLSX / Open XML spreadsheet documents are basically zipped/packaged storage containers, so I've taken a look at the contents of an Excel file that contains some attachments added in this way, and I've tried to go about understanding the various schemas and how I might replicate the effect, but I'm struggling to wrap my head around it to be honest and I'm wondering if any other libraries might exist that do this sort of leg work already?
One of the things I love about EPPlus (Old Codeplex Page | nuget Package) is being able to take a DataSet or DataTable and insert that directly in to a worksheet at a given cell reference. I also like that I can use built in Excel styles or define my own and apply those. I can create really lovely looking spreadsheets (sad I know!) while writing very little code. So initially, I looked in to whether I might be able to use or extend EPPlus... And as described in this answer, EPPlus does expose the underlying XML, but from what I can figure - I'd need to:
add the icon/image data to the package first (the actual visual representation of the file in the worksheet) and make that live in the drawings and/or media folder within the XLSX,
the drawing data would need to exist in the new format and the legacy (VML) format (unless Spire XLS is just being overly backwards-compat friendly?? Side note: -- I believe if you use the Office SDK / Excel Interop DLLs you can call for the image information to be generated - but as this is a server based solution I'm looking to avoid that if possible),
I would need to register relationship IDs for those in various XML files,
add the attachment as a BIN file (assuming that's just a binary dump?) and create a relationship ID for that,
and then somehow tie all that together in my worksheet XML...
...headache inducing! Unfortunately I'm not really au-fait with the OpenXML-SDK and I'm not sure how quickly I could pick it up. There's a very real risk I could put a lot of effort in, only to end up with a corrupt / non-compliant file. Unless all of this just seems more complicated than it really is??
The other library that I have used before is NPOI (GitHub repo | nuget Package) -- this is based on Java POI a Java API for Microsoft documents. It supports the older Microsoft Office formats as well as the newer ones.
I've seen some SO answers such as this one which indicate its possible to use POI to embed other MS family documents, but I don't know if the .NET fork (NPOI) is fully implementing this stuff. I've found very little evidence of people doing this using that particular library... it may just be that this requirement is somewhat rare so I can't find examples?
Another example of someone solving the embed problem in Java's POI is here - but that appears to be writing in the older office format and using OLE1.0 embeds.
Just posting as I figure it may be possible one of you super helpful guys out there has done exactly this sort of thing before! ;)
Thank you for reading, and sorry if I've been a bit verbose / wasted too much of your time with the wall of text! Any help greatly appreciated!

Is csv with multi tabs/sheet possible?

I am calling a web service and the data from the web service is in csv format.
If I try to save data in xls/xlsx, then I get multiple sheets in a workbook.
So, how can I save the data in csv with multipletab/sheets in c#.
I know csv with multiple tabs is not practical, but is there any damn way or any library to save data in csv with multiple tabs/sheet?

CSV, as a file format, assumes one "table" of data; in Excel terms that's one sheet of a workbook. While it's just plain text, and you can interpret it any way you want, the "standard" CSV format does not support what your supervisor is thinking.
You can fudge what you want a couple of ways:
Use a different file for each sheet, with related but distinct names, like "Book1_Sheet1", "Book1_Sheet2" etc. You can then find groups of related files by the text before the first underscore. This is the easiest to implement, but requires users to schlep around multiple files per logical "workbook", and if one gets lost in the shuffle you've lost that data.
Do the above, and also "zip" the files into a single archive you can move around. You keep the pure CSV advantage of the above option, plus the convenience of having one file to move instead of several, but the downside of having to zip/unzip the archive to get to the actual files. To ease the pain, if you're in .NET 4.5 you have access to a built-in ZipFile implementation, and if you are not you can use the open-source DotNetZip or SharpZipLib, any of which will allow you to programmatically create and consume standard Windows ZIP files. You can also use the nearly universal .tar.gz (aka .tgz) combination, but your users will need either your program or a third-party compression tool like 7Zip or WinRAR to create the archive from a set of exported CSVs.
Implement a quasi-CSV format where a blank line (containing only a newline) acts as a "tab separator", and your parser would expect a new line of column headers followed by data rows in the new configuration. This variant of standard CSV may not readable by other consumers of CSVs as it doesn't adhere to the expected file format, and as such I would recommend you don't use the ".csv" extension as it will confuse and frustrate users expecting to be able to open it in other applications like spreadsheets.

If I try to save data in xls/xlsx, then I get multiple sheets in a workbook.
Your answer is in your question, don't use text/csv (which most certainly can not do multiple sheets, it can't even do one sheet; there's no such thing as a sheet in text/csv though there is in how some applications like Excel or Calc choose to import it into a format that does have sheets) but save it as xls, xlsx, ods or another format that does have sheets.
Both XLSX and ODS are much more complicated than text/csv, but are each probably the most straightforward of their respective sets of formats.

I've been using this library for a while now,
https://github.com/SheetJS/js-xlsx
in my projects to import data and structure from formats like: xls(x), csv and xml but you can for sure save in that formats as well (all from client)!
Hope that can help you,, take a look on online demo,
http://oss.sheetjs.com/js-xlsx/
peek in source code or file an issue on GH? but I think you will have to do most coding on youre own

I think you want to reduce the size of your excel file. If yes then you can do it by saving it as xlsb i.e., Excel Binary Workbook format. Further, you can reduce your file size by deleting all the blank cells.

What is the best / fastest way to export large set of data from C# to excel

I have code that uses the OpenXML library to export data.
I have 20,000 rows and 22 columns and it takes ages (about 10 minutes).
is there any solution that would export data from C# to excel that would be faster as i am doing this from an asp.net mvc app and many people browsers are timing out.

Assuming 20'000 rows and 22 columns with about 100 bytes each, makes 41 megabytes data alone. plus xml tags, plus formatting, I'd say you end up zipping (.xlsx is nothing but several zipped xml files) 100 mb of data.
Of course this takes a while, and so does fetching the data.
I recommend you use excel package plus instead of the Office OpenXML development kit.
http://epplus.codeplex.com/
There's probably a bug/performance-issue in the write-in-a-hurry-and-hope-that-it-doesnt-blow-up-too-soon Microsoft code.

CSV. It is a plain text file, but can be opened by any version of Excel.
No doubt it is a easier way to export data to excel. A lot of website provide data export as CSV.
What you need to do is just add a comma (,) to separate the values and a line break to separate the records. It won't take extra resource to build the csv file, so it is quite fast.

I wound up using an open source solution called ClosedXML that worked great

Depending on what version of Excel you are targetting, you could expose the data as an OData service which Excel 2010 can naturally consume and will handle the downloading and formattting for you.

I am assuming that this data is something that needs to be completely sent to the client and has already been pre-filtered in some fashion, but still needs to be sent back to the person who made the request.
In this case, you want to perform this particular operation 'asynchronously'. I'm not sure if this would fit your workflow, but say that a person requests this large XML formatted document, I would: a) queue another worker thread to kick off the generation of the document while returning a 'token' (perhaps a GUID to the requester); b) return a link to a page where the requestor can click on the link (passing the token) allowing the page to look up results.
If the thread has completed processing the document, it places it into a special folder with a unique name and adds the token to a database table with its document location. If the person requests that page, the token exists in the database and the document exists on the file system, they are allowed to click and download it through HTTP. If it does not exist, they are either told it does not exist or to wait for the results. (This message can be based on the time the request was received.)
If the person downloads the document successfully (and you can do this through script), you can remove the entry for the database for the document with that token and delete the file from the file system.
I hope I read this question correctly.

I have found that I can speed up exporting data from a database into an Excel spreadsheet by limiting the number of export operations. I found that by accumulating 100 lines of data before writing, the creation speed increased by a factor of at least 5-10x.

The mistake when exporting data that is most often done when exporting data is in the workflow
Build Model
Build XML DOM
Save XML DOM to file
This workflow leads to an overhead because building up the XML DOM needs it's time, the XML DOM is kept in memory together with the Model and then the whole bunch of data is written to a file.
A better way to handle this is to convert your model entry by entry directly to the target format and write it directly to a (buffered) file.
A format with low overhead that's fast to write and is readable by Excel is CSV (ok, it's legacy, it's awkward...).

Reading Word Documents stored in Oracle DB as a BLOB object using C#

We store a word document in an Oracle 10g database as a BLOB object. I want to read the contents (the text) of this word document, make some changes, and write the text alone to a different field in a C# code.
How do I do this in C# 2.0?
The easiest logic that I came up with is this -
Read the BLOB object
Store it in the FileSystem
Extract the text contents
Do your job
Write the text into a separate field.
I can use Word.dll but not any commercial solutions such as Aspose

I assume that you already know how to do steps 1 and 2 (use the Oracle.DataAccess and System.IO namespaces).
For step 3 and 5, use Word Automation. This MS support article shows you how to get started: How to automate Microsoft Word to create a new document by using Visual C#
If you know what version of Word it will be, then I'd suggest using early binding, otherwise use late binding. More details and sample code here: Using early binding and late binding in Automation
Edit: If you don't know how to use BLOBs from C#, take a look here: How to: Read and Write BLOB Data to a Database Table Through an Anonymous PL/SQL Block

This keeps coming up in my searches, so I'll add an answer for the benefit of future readers.
I highly recommend avoiding Word automation. It's painfully slow and subjects you to the whims of Microsoft's developers with each upgrade. Instead, process the files manually yourselves if you can. The files are nothing but zipped archives of XML files and resources (such as images embedded in the document).
In this case, you'd simply unzip the docx using your preferred library, manipulate the XML, and then zip the result back up.
This does require the use of docx files rather than doc files, but as the link above explains, this has been the default Word format since Office 2007 and shouldn't present an issue unless your users are desperately clinging to the past.
For an example of the time savings, Back in 2007 we converted one process that took 45 minutes using Word automation and, on the same hardware, it took 15 SECONDS processing the files manually. To be clear, I'm not blaming Microsoft for this - their Word automation methods don't know how you will manipulate the document, so they have to anticipate and track everything that you could possibly change. You, on the other hand, can write your method with laser focus because you know exactly what you want to do.

Performant way to store interim PDF documents in document aggregation?

I have to load a few hundred documents of various sorts from disk and produce one giant aggregated PDF.
My approach is to convert each document to PDF individually, then aggregate them.
I have been writing each PDF document to isolated storage as part of this process.
Is there are better way to do this?
Performance is a priority - this is client side.

Have you heard of Pdftk - http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ (free) - I use it to merge multiple pdfs into one.
In combination with BullZip PDF printer (free) - http://www.bullzip.com/products/pdf/info.php - I can basically create any PDFs I need.
I'm not sure if you can do any automation with BullZip, but with Pdftk you could certainly work with it through C# and System.Diagnostics.Process.
Here's a typical command line that I use:
pdftk folder1/file1.pdf folder1/file2.pdf folder2/file1.pdf cat output all_files.pdf
That merges the 3 files, in order, to all_files.pdf

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.