How to split a document in OpenXML SDK - c#

I need to split a document in OpenXml sdk 2.0. The document has sections that each have a footer with a text element (name of the section). Is there a straightforward way to copy from one OpenXml document to another?

DocumentBuilder is the tool you are looking for. See for example, http://blogs.msdn.com/b/ericwhite/archive/2010/01/08/how-to-control-sections-when-using-openxml-powertools-documentbuilder.aspx

This would require a lot of work on your part to copy and merge stylesheets among other things. I'd recommend using altChunk to do the merging for you as it will take care of all the hard stuff for you. Here are two links to help explain it more: How to Use altChunk for Document Assembly and How to: Assemble Multiple Word Processing Documents in One

I have done similar to what you describe using just the OpenXmlSDK. Though I have to say it wasn't much fun, and I was left wanting a solution I didn't have to carve myself. In my case I had to keep footers/headers etc. with section content and split the document into several other documents.
At the time I couldn't locate any samples on identifying which section an element belonged to, and had to write a utility myself. (The way word splits sections is by injecting a section break after the content, and the SDK didn't seem to provide any helpers.) I then had to locate the header definition by using the headerReference and grab that content too, before creating a new document and injecting the header, footer, and section content.
I wish you the best of luck!

Related

Headers lose styles after merging documents with Xceed Docx

Let me explain my scenario.
I am making use of Xceed Docx library to merge and manipulate word documents.
I have multiple templates that needs to be merged to form one customer facing document.
All of them having individual document headers, tables and images.
As per business requirements, we need to make use of content controls as there will be manual intervention.
PROBLEM:
All goes well and the merge works as expected, but it seems to drop the styling of the headers in merged document. But this only occurs when I include CONTENT CONTROLS (rich text content control)!
For example: Header 1, Header 2 becomes normal text....
Has anyone experienced anything similar with this library?
Is there something I am doing wrong or missing?
I did try and contact the developers of DocX, with no avail.
I tried merging the files with OpenXml using AltChunk.
This did work but not to the extend that I required.
Let me explain.
AltChunk inserts the entire file (doc2.docx) into the base file(doc1.docx)
and then only add reference of doc2 inside doc1's XML file.
Hope that makes sense.
MS Word can open this file, but when I want to make changes using DocX it is unable to load the file.
I ended up using Docx for all the document manipulation and OpenXmlPowerTools to merge the documents.
OpenXmlPowerTools seems to resolve the above mentioned issue as its does seem to do a complete image, chart and text merge.
I hope this helps someone in the near future ;-P

Navigate By Columns using Aspose.words in C#

Am evaluating the Aspose.words for one of my client, almost all of the feature i have migrated from MS Word library to Aspose.word library. Just one more to go, but am struggling to find the solution for the below:
We have Template document which is in .docx format. Template has a Two column page layout. at run time system would copy paste the content from other document to this Template document. still this steps works fine.
When i open the template page it looks good with 2 column layout.
But we have some logic that should read the last line of First Column & checks whether the text is in specific format, if it is then moves one line down which would automaticaly moves to the next column.
This logic is easily acheivable in Word but i couldn't find any refference in Aspose.words to implement this.
Also i tried to find different option by convering the document to Xml. & found that there is one node called . but this node is visble only when i save the document as xml From Microsoft word. Not occurs if i save the document as xml from Aspose.words.
Please advice me to solve this issue.
Thanks in advance
Gunasekara S
we just have finished integrating a feature into Aspose.Words to open up access to the rendering engine so that each element of the rendered document can be read as it appears as pages, columns, lines, spans etc. This functionality is exactly what you need and will be available in the next version of Aspose.Words which is expected to release in about a week's time. Soon I will be sharing the code snippet to accomplish your requirement.
My name is Nayyer and I am developer evangelist at Aspose.

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.
If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.
Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.
Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Merge documents

I'm trying to merge two docx-documents into one docx-document using OpenXML SDK 2.0. The documents should be merged without loosing their styling and custom headers and footers. I hope I can achieve this using AltChunk and a section break. But I can't get it working.
Is it possible what I'm trying to do? Can someone give me a hint how to achieve this?
The above answer is NOT correct at all! This is EXACTLY what AltChunk has been designed to do, and it works great!
NOTE: that the documents will not be merged into one document UNTIL Word opens the file for the first time (obviously the file has to be saved or the file on disk won't be updated.)
See this blog for more information on how to do it properly:
https://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx?Redirected=true
p.s. As for examining Open XML using the productivity tool, my opinion is to just install the official Visual Studio Open XML add-on and open the Office Documents from Visual Studio to examine them, it's super convenient! :-)
Using the 'Open XML Productivity Tool' I analyzed the structure of a docx-document, and concluded that merging documents with their style, headers, footers, ... is not possible out of the box using Altchunk. You can download the tool seperatly from the open xml sdk.
What I'm doing now, and what is working, is copying everything manually into to document, making sure that all style-references, header-references, footer-references, ... are preserved. This means that I give them a new unique id before I copy them into the document and changing all references from the old id to the new id. There is a lot of code to do this, but the tool mentioned above really helped.
Adding a section break is also quite difficult. You should know that the SectionProperties-tag describes all the properties of the section and that there can be one SectionProperties-tag under the Body-tag, describing the properties of the last section. So adding a new sectionbreak, means copying the last SectionProperties-tag to the last paragraph of the section and adding a new SectionProperties-tag under the Body-tag. I also got al lot of information from the productivity tool.

What is the best way to populate a Word 2007 template in C#?

I have a need to populate a Word 2007 document from code, including repeating table sections - currently I use an XML transform on the document.xml portion of the docx, but this is extremely time consuming to setup (each time you edit the template document, you have to recreate the transform.xsl file, which can take up to a day to do for complex documents).
Is there any better way, preferably one that doesn't require you to run Word 2007 during the process?
Regards
Richard
I tried myself to write some code for that purpose, but gave up. Now I use a 3rd party product: Aspose Words and am quite happy with that component.
It doesn't need Microsoft Word on the machine.
"Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®."
"Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market."
DISCLAIMER: I am not affiliated with that company.
Since a DOCX file is simply a ZIP file containing a folder structure with images and XML files, you should be able to manipulate those XML files using our favorite XML manipulation API. The specification of the format is known as WordprocessingML, part of the Office Open XML standard.
I thought I'd mention it in case the 3rd party tool suggested by splattne is not an option.
Have you considered using the Open XML SDK from Microsoft? The only dependency is on .NET 3.5.
Documentation: http://msdn.microsoft.com/en-us/library/bb448854%28office.14%29.aspx
Download: http://www.microsoft.com/downloads/details.aspx?familyid=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en
Use invoke docx lib. it supports table data (http://invoke.co.nz/products/help/docx_tables.aspx). More info at http://invoke.co.nz/products/docx.aspx
Have you considered using VB? You could create a separate assembly to populate your document.
I know you are looking for a C# solution, but the XML literal support is one area where XML literal support could help you populate the document. Create a document in Word to server as a template, unzip the docx, paste the relevant XML section you want to change into you VB code, and add code to fill in the parts you wish to change. It's difficult to say from your description if this would meet your requirements but I would suggest looking into it.

Categories

Resources