How do I extract attachments from a pdf file? - c#

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?

iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.
There are two ways to embed files in a PDF:
In a File Annotation
At the document level "EmbeddedFiles".
Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).
So to list all the files at the document level, one would write code (in Java) as such:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfString name = embeddedFiles.getAsString(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
PRStream stream = null;
if (streams.contains(PdfName.UF))
stream = (PRStream)streams.getAsStream(PdfName.UF);
else
stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility
if (stream != null) {
files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
}
}

This is an old question, nonetheless I think my alternative solution (using PDF Clown) may be of some interest as it's way much cleaner (and more complete, as it iterates both at document and page level) than the code fragments previously proposed:
using org.pdfclown.bytes;
using org.pdfclown.documents;
using org.pdfclown.documents.files;
using org.pdfclown.documents.interaction.annotations;
using org.pdfclown.objects;
using System;
using System.Collections.Generic;
void ExtractAttachments(string pdfPath)
{
Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>();
using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath))
{
Document document = file.Document;
// 1. Embedded files (document level).
foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles)
{EvaluateDataFile(attachments, entry.Value);}
// 2. File attachments (page level).
foreach(Page page in document.Pages)
{
foreach(Annotation annotation in page.Annotations)
{
if(annotation is FileAttachment)
{EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);}
}
}
}
}
void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile)
{
if(dataFile is FullFileSpecification)
{
EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile;
if(embeddedFile != null)
{attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();}
}
}
Note that you don't have to bother with null pointer exceptions as PDF Clown provides all the necessary abstraction and automation to ensure smooth model traversal.
PDF Clown is an LGPL 3 library, implemented both in Java and .NET platforms (I'm its lead developer): if you want to get it a try, I suggest you to check out its SVN repository on sourceforge.net as it keeps evolving.

Look for ABCpdf-Library, very easy and fast in my opinion.

What I got working is slightly different then anything else I have seen online.
So, just in case, I thought I would post this here to help someone else. I had to go through many different iterations to figure out - the hard way - what I needed to get it to work.
I am merging two PDFs into a third PDF, where one of the first two PDFs may have file attachments that need to be carried over into the third PDF. I am working completely in streams with ASP.NET, C# 4.0, ITextSharp 5.1.2.0.
// Extract Files from Submit PDF
Dictionary<string, byte[]> files = new Dictionary<string, byte[]>();
PdfDictionary names;
PdfDictionary embeddedFiles;
PdfArray fileSpecs;
int eFLength = 0;
names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream
if (names != null)
{
embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null
if (embeddedFiles != null)
{
fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null
if (fileSpecs != null)
{
eFLength = fileSpecs.Size;
for (int i = 0; i < eFLength; i++)
{
i++; //objects are in pairs and only want odd objects (1,3,5...)
PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null
if (fileSpec != null)
{
PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys)
{
PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));
if (stream != null)
{
files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream));
}
}
}
}
}
}
}

You may try Aspose.Pdf.Kit for .NET. The PdfExtractor class allows you to extract attachments with the help of two methods: ExtractAttachment and GetAttachment. Please see an example of attachment extraction.
Disclosure: I work as developer evangelist at Aspose.

Related

Why does my PDF file size increase after splitting and merging back? (Using PDFSharp c#)

I am basically splitting a PDF document into multiple documents containing one page each. After splitting I perform some operations and the merge the documents back to a single PDF. I am using PDFsharp in c# to do this. Now the problem I am facing is that when I split the document and then add them back, the file size increases from 1.96Mbs to 12.2Mbs. Now after thoroughly testing, I have pointed out that the problem lies not in the operations which I performing after splitting but in the actual splitting and merging of PDF documents. The following are my functions which I have created.
public static List<Stream> SplitPdf(Stream PdfDoc)
{
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
List<Stream> outputStreamList = new List<Stream>();
PdfSharp.Pdf.PdfDocument inputDocument = PdfReader.Open(PdfDoc, PdfDocumentOpenMode.Import);
for (int idx = 0; idx < inputDocument.PageCount; idx++)
{
PdfSharp.Pdf.PdfDocument outputDocument = new PdfSharp.Pdf.PdfDocument();
outputDocument.Version = inputDocument.Version;
outputDocument.Info.Title =
String.Format("Page {0} of {1}", idx + 1, inputDocument.Info.Title);
outputDocument.Info.Creator = inputDocument.Info.Creator;
outputDocument.AddPage(inputDocument.Pages[idx]);
MemoryStream stream = new MemoryStream();
outputDocument.Save(stream);
outputStreamList.Add(stream);
}
return outputStreamList;
}
public static Stream MergePdfs(List<Stream> PdfFiles)
{
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfSharp.Pdf.PdfDocument outputPDFDocument = new PdfSharp.Pdf.PdfDocument();
foreach (Stream pdfFile in PdfFiles)
{
PdfSharp.Pdf.PdfDocument inputPDFDocument = PdfReader.Open(pdfFile, PdfDocumentOpenMode.Import);
outputPDFDocument.Version = inputPDFDocument.Version;
foreach (PdfSharp.Pdf.PdfPage page in inputPDFDocument.Pages)
{
outputPDFDocument.AddPage(page);
}
}
Stream compiledPdfStream = new MemoryStream();
outputPDFDocument.Save(compiledPdfStream);
return compiledPdfStream;
}
The question which I have is:
Why am I getting this behaviour?
Is there a solution where I can perform split and merge and then get the file of same size? (Can be of any open-source c# library)
Replying to question 1:
When splitting the files, every file will contain all resources required by the pages it contains.
When merging with PDFsharp again, resources will not be merged and the final document may contain duplicated resources (fonts, images), thus leading to larger files.
This is by design.

Text extraction using itext7: garbage characters for some pdf documents

I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}
Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.
This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}
That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.
PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

Merging Word documents using WordProcessingDocument

I am currently working on a program in which a user should be able to merge several Word documents into one, without losing any formatting, headers and so on. The documents should simply stack up, one after another, without any changes.
Here is my current code:
public virtual Byte[] MergeWordFiles(IEnumerable<SendData> sourceFiles)
{
int f = 0;
// If only one Word document then skip merge.
if (sourceFiles.Count() == 1)
{
return sourceFiles.First().File;
}
else
{
MemoryStream destinationFile = new MemoryStream();
// Add first file
var firstFile = sourceFiles.First().File;
destinationFile.Write(firstFile, 0, firstFile.Length);
destinationFile.Position = 0;
int pointer = 1;
byte[] ret;
// Add the rest of the files
try
{
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(destinationFile, true))
{
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
for (pointer = 1; pointer < sourceFiles.Count(); pointer++)
{
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(sourceFiles.ElementAt(pointer).File), true);
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
newBody.Add(XElement.Parse(new DocumentFormat.OpenXml.Wordprocessing.Paragraph(new Run(new Break { Type = BreakValues.Page })).OuterXml));
newBody.Add(tempBody);
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
}
}
}
catch (OpenXmlPackageException oxmle)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), oxmle);
}
catch (Exception e)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), e);
}
finally
{
ret = destinationFile.ToArray();
destinationFile.Close();
destinationFile.Dispose();
}
return ret;
}
}
The problem here is that the formatting is copied from the first document and applied to all the rest, meaning that for instance a different header in the second document will be ignored. How do I prevent this?
I have been looking in to breaking the document in to sections using SectionMarkValues.NextPage, as well as using altChunk.
The problem with the latter is altChunk does not seem to be able to handle a MemoryStream into its "FeedData" method.
DocIO is a .NET library that can read, write, merge and render Word 2003/2007/2010/2013/2016 files. The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Step 1: Create a console application
Step 2: Add reference to Syncfusion.DocIO.Base, Syncfusion.Compression.Base and Syncfusion.OfficeChart.Base; You can add these reference to your project using NuGet also.
Step 3: Copy & paste the following code snippet.
This code snippet will produce the document as per your requirement; each input Word document will get merged with its original formatting, styles and headers/footer.
using Syncfusion.DocIO.DLS;
using Syncfusion.DocIO;
using System.IO;
namespace DocIO_MergeDocument
{
class Program
{
static void Main(string[] args)
{
//Boolean to indicate whether any of the input document has different odd and even headers as true
bool isDifferentOddAndEvenPagesEnabled = false;
// Creating a new document.
using (WordDocument mergedDocument = new WordDocument())
{
//Get the files from input directory
DirectoryInfo dirInfo = new DirectoryInfo(System.Environment.CurrentDirectory + #"\..\..\Data");
FileInfo[] fileInfo = dirInfo.GetFiles();
for (int i = 0; i < fileInfo.Length; i++)
{
if (fileInfo[i].Extension == ".doc" || fileInfo[i].Extension == ".docx")
{
using (WordDocument sourceDocument = new WordDocument(fileInfo[i].FullName))
{
//Check whether the document has different odd and even header/footer
if (!isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in sourceDocument.Sections)
{
isDifferentOddAndEvenPagesEnabled = section.PageSetup.DifferentOddAndEvenPages;
if (isDifferentOddAndEvenPagesEnabled)
break;
}
}
//Sets the breakcode of First section of source document as NoBreak to avoid imported from a new page
sourceDocument.Sections[0].BreakCode = SectionBreakCode.EvenPage;
//Imports the contents of source document at the end of merged document
mergedDocument.ImportContent(sourceDocument, ImportOptions.KeepSourceFormatting);
}
}
}
//if any of the input document has different odd and even headers as true then
//Copy the content of the odd header/foort and add the copied content into the even header/footer
if (isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in mergedDocument.Sections)
{
section.PageSetup.DifferentOddAndEvenPages = true;
if (section.HeadersFooters.OddHeader.Count > 0 && section.HeadersFooters.EvenHeader.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddHeader.Count; i++)
section.HeadersFooters.EvenHeader.ChildEntities.Add(section.HeadersFooters.OddHeader.ChildEntities[i].Clone());
}
if (section.HeadersFooters.OddFooter.Count > 0 && section.HeadersFooters.EvenFooter.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddFooter.Count; i++)
section.HeadersFooters.EvenFooter.ChildEntities.Add(section.HeadersFooters.OddFooter.ChildEntities[i].Clone());
}
}
}
//If there is no document to merge then add empty section with empty paragraph
if (mergedDocument.Sections.Count == 0)
mergedDocument.EnsureMinimal();
//Saves the document in the given name and format
mergedDocument.Save("result.docx", FormatType.Docx);
}
}
}
}
Downloadable Demo
Note: There is a Word document (not section) level settings for
applying different header/footer for odd and even pages. Each input
document can have different values for this property. if any of the
input document has different odd and even header/footer as true, it
will affect the visual appearance of header/footer in the resultant
document. Hence, if any of the input document has different odd and
even header/footer, then the resultant Word document will have been
replaced with the odd header/footer contents.
For further information about DocIO, please refer our help documentation
Note: I work for Syncfusion

OpenXml Sdk - Copy Sections of docx into another docx

I am trying the following code. It takes a fileName (docx file with many sections) and I try to iterate through each section getting the section name. The problem is that I end up with unreadable docx files. It does not error, but I think I am doing something wrong with getting the elements in the section.
public void Split(string fileName) {
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(fileName, true)) {
string curCliCode = "";
MainDocumentPart mdp = myDoc.MainDocumentPart;
foreach (var element in mdp.Document.Body.ChildElements) {
if (element.Descendants().OfType<SectionProperties>().Count() == 1) {
//get the name of the section from the footer
var footer = (FooterPart) mdp.GetPartById(
element.Descendants().OfType<SectionProperties>().First().OfType
<FooterReference>().First().
Id.Value);
foreach (Paragraph p in footer.Footer.ChildElements.OfType<Paragraph>()) {
if (p.InnerText != "") {
curCliCode = p.InnerText;
}
}
if (curCliCode != "") {
var forFile = new List<OpenXmlElement>();
var els = element.ElementsBefore();
if (els != null) {
foreach (var e in els) {
if (e != null) {
forFile.Add(e);
}
}
for (int i = 0; i < els.Count(); i++) {
els.ElementAt(i).Remove();
}
}
Create(curCliCode, forFile);
}
}
}
}
}
private void Create(string cliCode,IEnumerable<OpenXmlElement> docParts) {
var parts = from e in docParts select e.Clone();
const string template = #"\Test\toSplit\blank.docx";
string destination = string.Format(#"\Test\{0}.docx", cliCode);
File.Copy(template, destination,true);
/* Create the package and main document part */
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(destination, true)) {
MainDocumentPart mainPart = myDoc.MainDocumentPart;
/* Create the contents */
foreach(var part in parts) {
mainPart.Document.Body.Append((OpenXmlElement)part);
}
/* Save the results and close */
mainPart.Document.Save();
myDoc.Close();
}
}
Does anyone know what the problem could be (or how to properly copy a section from one document to another)?
I've done some work in this area, and what I have found invaluable is diffing a known good file with a prospective file; the error is usually fairly obvious.
What I would do is take a file that you know works, and copy all of the sections into the template. Theoretically, the two files should be identical. Run a diff on them the document.xml inside the docx file, and you'll see the difference.
BTW, I'm assuming that you know that the docx is actually a zip; change the extension to "zip", and you'll be able to get at the actual xml files which compose the format.
As far as diff tools, I use Beyond Compare from Scooter Software.
An approach along the lines of what you are doing will work only for simple documents (ie those not containing images, hyperlinks, comments etc). To handle these more complex documents, take a look at http://blogs.msdn.com/b/ericwhite/archive/2009/02/05/move-insert-delete-paragraphs-in-word-processing-documents-using-the-open-xml-sdk.aspx and the resulting DocumentBuilder API (part of the PowerTools for Open XML project on CodePlex).
In order to split a docx into sections using DocumentBuilder, you'll still need to first find the index of the paragraphs containing sectPr elements.

Categories

Resources