Merge documents

Merge documents - c#

I'm trying to merge two docx-documents into one docx-document using OpenXML SDK 2.0. The documents should be merged without loosing their styling and custom headers and footers. I hope I can achieve this using AltChunk and a section break. But I can't get it working.
Is it possible what I'm trying to do? Can someone give me a hint how to achieve this?

The above answer is NOT correct at all! This is EXACTLY what AltChunk has been designed to do, and it works great!
NOTE: that the documents will not be merged into one document UNTIL Word opens the file for the first time (obviously the file has to be saved or the file on disk won't be updated.)
See this blog for more information on how to do it properly:
https://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx?Redirected=true
p.s. As for examining Open XML using the productivity tool, my opinion is to just install the official Visual Studio Open XML add-on and open the Office Documents from Visual Studio to examine them, it's super convenient! :-)

Using the 'Open XML Productivity Tool' I analyzed the structure of a docx-document, and concluded that merging documents with their style, headers, footers, ... is not possible out of the box using Altchunk. You can download the tool seperatly from the open xml sdk.
What I'm doing now, and what is working, is copying everything manually into to document, making sure that all style-references, header-references, footer-references, ... are preserved. This means that I give them a new unique id before I copy them into the document and changing all references from the old id to the new id. There is a lot of code to do this, but the tool mentioned above really helped.
Adding a section break is also quite difficult. You should know that the SectionProperties-tag describes all the properties of the section and that there can be one SectionProperties-tag under the Body-tag, describing the properties of the last section. So adding a new sectionbreak, means copying the last SectionProperties-tag to the last paragraph of the section and adding a new SectionProperties-tag under the Body-tag. I also got al lot of information from the productivity tool.

Related

Headers lose styles after merging documents with Xceed Docx

Let me explain my scenario.
I am making use of Xceed Docx library to merge and manipulate word documents.
I have multiple templates that needs to be merged to form one customer facing document.
All of them having individual document headers, tables and images.
As per business requirements, we need to make use of content controls as there will be manual intervention.
PROBLEM:
All goes well and the merge works as expected, but it seems to drop the styling of the headers in merged document. But this only occurs when I include CONTENT CONTROLS (rich text content control)!
For example: Header 1, Header 2 becomes normal text....
Has anyone experienced anything similar with this library?
Is there something I am doing wrong or missing?

I did try and contact the developers of DocX, with no avail.
I tried merging the files with OpenXml using AltChunk.
This did work but not to the extend that I required.
Let me explain.
AltChunk inserts the entire file (doc2.docx) into the base file(doc1.docx)
and then only add reference of doc2 inside doc1's XML file.
Hope that makes sense.
MS Word can open this file, but when I want to make changes using DocX it is unable to load the file.
I ended up using Docx for all the document manipulation and OpenXmlPowerTools to merge the documents.
OpenXmlPowerTools seems to resolve the above mentioned issue as its does seem to do a complete image, chart and text merge.
I hope this helps someone in the near future ;-P

Navigate By Columns using Aspose.words in C#

Am evaluating the Aspose.words for one of my client, almost all of the feature i have migrated from MS Word library to Aspose.word library. Just one more to go, but am struggling to find the solution for the below:
We have Template document which is in .docx format. Template has a Two column page layout. at run time system would copy paste the content from other document to this Template document. still this steps works fine.
When i open the template page it looks good with 2 column layout.
But we have some logic that should read the last line of First Column & checks whether the text is in specific format, if it is then moves one line down which would automaticaly moves to the next column.
This logic is easily acheivable in Word but i couldn't find any refference in Aspose.words to implement this.
Also i tried to find different option by convering the document to Xml. & found that there is one node called . but this node is visble only when i save the document as xml From Microsoft word. Not occurs if i save the document as xml from Aspose.words.
Please advice me to solve this issue.
Thanks in advance
Gunasekara S

we just have finished integrating a feature into Aspose.Words to open up access to the rendering engine so that each element of the rendered document can be read as it appears as pages, columns, lines, spans etc. This functionality is exactly what you need and will be available in the next version of Aspose.Words which is expected to release in about a week's time. Soon I will be sharing the code snippet to accomplish your requirement.
My name is Nayyer and I am developer evangelist at Aspose.

How to read metadata information from docx documents?

what I need to achieve is to have a word document template(docx), which will contain Title, Author name, Date, etc.
This template then will be used by users to complete it. I need to create a c# program, that will take in the docx file and read all the information of interest(title, name, date, ..).
So my questions are:
How do I put the metadata into the template saying: this is Title, this is Date, this is Name, etc? (not programatically)
How do I programmatically read that information?

One way to approach this would be to use Content Controls. In Office, you can create your template, and then for each of your respective inputs of interest you can place one of these controls. They're under the Developer tab in Office.
After inserting your controls you'll need for each of them to have a unique name. Office will let them all have the same name, but you'll need to uniquely identify all of them in your template document.
You now need to get the data that's input in to these controls. Again, there's likely to be some better solutions but Eric White has all kinds of great OpenXML stuff, and so here's one of his: Iterating over Content Controls
I think there's problems with finding content controls nested within a table. So, if you do that, then I think you have to specifically loop over the elements of the table to find content controls within.
Also, you're probably going to want to save a .docx from your .doct file, which I don't think there's any built-in "one-liner" method in OpenXML; however, you can create a new Word document, and then write the file stream of the template in to the newly created docx file. Again, of course, there may be better solutions out there.
Have you been here? There's lots of good stuff:
Introduction to OpenXML
Additionally, Eric has been releasing more and more videos on the OpenXML YouTube channel

1) how do I put the metadata into the template saying: this is Title,
this is Date, this is Name, etc? (not programatically)
You could do that on Info tab in MS Word 2010 as shown below:
2) how do I programmatically read that information?
Once you created your document (or template) you could always look inside it with Open XML SDK 2.0 Productivity Tool (wich is installed with OpenXML SDK) to see where (what classes to use) to get/set some information from/to document.
Also I think this post might help you to solve your task:
Add and update custom document properties in a docx
UPDATE:
Hi Dave,
Please have a look at this MSDN Article - Retrieving Application Properties from Word 2010 Documents by Using the Open XML SDK 2.0
Hope this is exactly what you are looking for.

All OpenXML documents have built in core Metadata that will do what you need through System.IO.Packaging. Once you open the word file using the open xml sdk in c#, you can get to these values via the PackageProperties class. There are 11 Properties you can use.
You "encourage" your user to enter the metadata using Word's Document Information Panel (DIP).
You can force this on by default when they open your template, by a setting in the Developer Toolbar for the template. See the following article on how to set this in your template.
I wrote a quick Windows Form app that displays this information using open xml sdk call to the PackageProperties of the Word file that is displayed above.
Here is the full solution with the sample word file included.
Hope this helps.

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Generating a Word Document with C#

Given a list of mailing addresses, I need to open an existing Word document, which is formatted for printing labels, and then insert each address into a different cell of the table. The current solution opens the Word application and moves the cursor to insert the text. However, after reading about the security issues and problems associated with opening the newer versions of Word from a web application, I have decided that I need to use another method.
I have looked into using Office Open XML, but I have not found any good resources that provide concrete information on exactly how to use it. Also, someone suggested that I use SQL reporting services, but searching for information on how to use them, lead me nowhere.
Which method do you think is the most appropriate for my problem?
Code samples and links to good tutorials would be extremely helpful.

Thanks for all the answers, but I really did not want to pay for a plugin and using Word automation was out of the question. So I kept searching and eventually, through some trial and error, found some answers.
After throughly searching through Microsoft's site, I found some newer articles on the Office Open XML SDK. I downloaded the new tools and just started going through each them.
I then found the Document Reflector, which creates a class to generate XML code based off an existing Word Document (.docx). Using my Label Template Document and the code this tool generated, I went through and added a loop that appends table cells for each address. It actually proved to be fairly simple and way faster than using Word automation.
So, if you're still using Word automation check out the Office Open XML tools. Their surprisingly extensive for a free download from Microsoft.
Office Open XML SDK 2.0 Download

I use the Words plugin from Aspose.com to do mail merges (programming guide).

You can take a look show 137 and 138 on dnrTV (www.dnrtv.com). In these video's Beth Massi shows how to do some editing and mail merging with OpenXML. She does this by using the Open XML SDK and xml literals in VB. It requires no third party components. Also it doesn't require MS Office to be installed on the machine.
This video inspired me as a C# developed (and no VB experience) to do some XML manipulation in a separate dll in VB. I call into this dll from my C# application.
It is worth a try.

We have the product Aspose that tvanfosson has mentioned. The edition that we purchased works with SQL Reporting Services so it can be used with the scheduler for creating output. It is really a great product and we used in a system that needed to support Korean characters in the final document. It works great and was under $1K with support. Not bad.
The advantage of using a product like this is that you can continue to manage your data and the skill set required to produce the documents is at a level where a variety of developers can support its use.

Vanstee,
If you really want to do this in code, check out this post I just found on Google
http://kellychronicles.spaces.live.com/blog/cns!A0D71E1614E8DBF8!1364.entry

If you are using reporting services cant you just move the information in the word doc into a database table and read it from there, taking word out of the equation?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.