I build a small application using .NET 6 that is replacing values inside a word document and saving a copy.
Some keys are replaced with other files content using an AltChunk.
Using file A, in which I merge AltChunk1, the output is working fine.
Using file B with same AltChunk1, the output produce the error "found unreadable content" when opening with Word.
Using file B and a different AltChunk file (even the same after I trimmed it) can, in some cases, work.
I don't have any clue what the issue might be.
I tried comparing files using OpenXML productivity tool however:
File A and File B have a lot of differences, it is really hard to find anything that would explain this behavior
They are identical in the place the AltChunk is put.
Tried comparing the not working result with what word is creating with a repair but word is not keeping the AltChunk, it completely merges content of AltChunk with File B making any comparison almost impossible with my non-working result.
Here is the code Is use:
First method is creating the AltChunk from file, then calls methods used to replace "keys" with the wanted value (including case where the key is split accross various runs)
internal static void MergeOutSideDocument(string key, string filePath, IEnumerable<string> outsideDocs)
{
if (string.IsNullOrEmpty(key)) throw new ArgumentException("Cannot replace empty key.");
if (!File.Exists(filePath) || outsideDocs.Any(path => !File.Exists(path))) throw new FileNotFoundException();
using WordprocessingDocument doc = WordprocessingDocument.Open(filePath, true);
List<OpenXmlElement> altChunks = new();
foreach (var outsideDoc in outsideDocs)
{
var existingIds = doc.MainDocumentPart.Document.Body.Descendants<AltChunk>();
string altChunkId = "AltChunkId" + DateTime.Now.Ticks.ToString();
MainDocumentPart mainPart = doc.MainDocumentPart;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(outsideDoc, FileMode.Open))
chunk.FeedData(fileStream);
altChunks.Add(new AltChunk()
{
Id = altChunkId
});
inMemoryAltChunkIds.Add(altChunkId);
}
var body = doc.MainDocumentPart.Document.Body;
SetElementForKey(key, altChunks,
body.Descendants<Paragraph>().First(par => par.Contains(key)),
body);
}
private static void SetElementForKey(string key, List<OpenXmlElement> replacements, OpenXmlElement el, Body body)
{
List<Run> previousRuns = new();
if (el?.InnerText.Contains(key) != true) return;
for (int i = 0; i <= el.Descendants<Run>().Count(); i++)
{
var innerText = string.Join("", previousRuns.Select(r => r.InnerText));
if (innerText.Contains(key))
{
var usedRuns = GetRequiredRunsForText(previousRuns, key);
var firstRun = usedRuns.First();
MergeRunsWithKey(key, usedRuns, firstRun);
var usedRun = usedRuns.First();
var firstPart = usedRun.InnerText.IndexOf(key) != -1 ? usedRun.InnerText[..usedRun.InnerText.IndexOf(key)] : "";
ReplaceText(key, "", usedRun);
foreach (var replacement in replacements)
el.Parent.InsertAfter(replacement, el);
if (string.IsNullOrEmpty(usedRun.InnerText)) usedRun.Remove();
if (string.IsNullOrEmpty(el.InnerText)) el.Remove();
break;
}
else
{
previousRuns.Add(el.Descendants<Run>().ElementAt(i));
}
}
}
private static void MergeRunsWithKey(string key, List<Run> usedRuns, Run firstRun)
{
while (!usedRuns.First().InnerText.Contains(key))
{
AddText(usedRuns.Skip(1).First().InnerText, firstRun);
usedRuns.Skip(1).First().Remove();
usedRuns.RemoveAt(1);
}
}
private static void AddText(string newText, Run run)
{
Text text = run.Elements<Text>().LastOrDefault();
if (text == null)
{
run.Append(new Text());
text = run.Elements<Text>().Last();
}
text.Text += newText;
if (text.Text.StartsWith(" ") || text.Text.EndsWith(" "))
text.Space = SpaceProcessingModeValues.Preserve;
}
What can I do to understand where the problem lies?
I tried replacing some values I don't understand in File B with the ones from File A (header and footer have rectangles with different gfxdata values, the "recovered" from word was setting the same values as File A).
I tried a different way of generating the AltChunkIds and storing a global list for the file.
I tried comparing various parts of the documents (File A and B or Fil B's result and its recovered version). There are differences, but too many and none seem to be relevant.
Related
How can one generate docx serial letter in ASP.NET MVC application?
I can fill in a simple docx template with data from DB by using docx with Content Controls and OpenXML library - as suggested for example here.
However, when trying to use this for serial letter and merge generated documents into single output docx (hint here), resulting serial letter has data of the first entry - e.g. when I was generating letter for 10 employees and feed this data, resulting output generated 10 letters but all with the data of the first employee.
Edit: (sample code added)
internal static Stream CreateMultiPartDocument(IList<object> data, string templatePath)
{
Stream mainDocumentStream = CreateTempDocument(data[0], templatePath);
for(int i = 1; i < data.Count; i++)
{
object childDocumentData = data[i];
Stream childDocumentStream = CreateTempDocument(childDocumentData, templatePath);
AppendChildDocument(mainDocumentStream, childDocumentStream);
}
mainDocumentStream.Flush();
mainDocumentStream.Position = 0;
return mainDocumentStream;
}
internal static Stream CreateTempDocument(object data, string templatePath)
{
string fullTemplatePath = Path.Combine(TEMPLATE_BASE_PATH, templatePath);
FileStream templateFile = File.Open(fullTemplatePath, FileMode.Open);
if(null != templateFile)
{
MemoryStream fileInMemory = new MemoryStream();
templateFile.CopyTo(fileInMemory);
string customXML = data.SerializeToXml();
ReplaceCustomXmlInMemory(fileInMemory, customXML);
fileInMemory.Flush();
fileInMemory.Position = 0;
templateFile.Close();
return fileInMemory;
}
return null;
}
private static void ReplaceCustomXmlInMemory(MemoryStream fileInMemory, string customXML)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(fileInMemory, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
mainPart.DeleteParts(mainPart.CustomXmlParts);
CustomXmlPart customXmlPart = mainPart.AddCustomXmlPart(CustomXmlPartType.CustomXml);
using (StreamWriter streamWriter = new StreamWriter(customXmlPart.GetStream()))
{
streamWriter.Write(customXML);
}
wordDoc.Close();
}
}
The only solution I was able to find is to switch from OpenXml to Syncfusion's FileFormats library - they support mail merge functionality from scratch, with multiple formats of input possible. See link here.
It is available also as a Nuget package, so it was super simple.
I'm really having trouble in editing bookmarks in a Word template using Document.Format.OpenXML and then saving it to a new PDF file.
I cannot use Microsoft.Word.Interop as it gives a COM error on the server.
My code is this:
public static void CreateWordDoc(string templatePath, string destinationPath, Dictionary<string, dynamic> dictionary)
{
byte[] byteArray = File.ReadAllBytes(templatePath);
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
var bookmarks = (from bm in wordDoc.MainDocumentPart.Document.Body.Descendants<BookmarkStart>()
select bm).ToList();
foreach (BookmarkStart mark in bookmarks)
{
if (mark.Name != "Table" && mark.Name != "_GoBack")
{
UpdateBookmark(dictionary, mark);//Not doing anything
}
else if (mark.Name != "Table")
{
// CreateTable(dictionary, wordDoc, mark);
}
}
File.WriteAllBytes("D:\\RohitDocs\\newfile_rohitsingh.docx", stream.ToArray());
wordDoc.Close();
}
// Save the file with the new name
}
}
private static void UpdateBookmark(Dictionary<string, dynamic> dictionary, BookmarkStart mark)
{
string name = mark.Name;
string value = dictionary[name];
Run run = new Run(new Text(value));
RunProperties props = new RunProperties();
props.AppendChild(new FontSize() { Val = "20" });
run.RunProperties = props;
var paragraph = new DocumentFormat.OpenXml.Wordprocessing.Paragraph(run);
mark.Parent.InsertAfterSelf(paragraph);
paragraph.PreviousSibling().Remove();
mark.Remove();
}
I was trying to replace bookmarks with my text but the UpdateBookmark method doesn't work. I'm writing stream and saving it because I thought if bookmarks are replaced then I can save it to another file.
I think you want to make sure that when you reference mark.Parent that you are getting the correct instance that you are expecting.
Once you get a reference to the correct Paragraph element where your content should go, use the following code to add/swap the run.
// assuming you have a reference to a paragraph called "p"
p.AppendChild<Run>(new Run(new Text(content)) { RunProperties = props });
// and here is some code to remove a run
p.RemoveChild<Run>(run);
To answers the second part of your question, when I did a similar project a few years ago we used iTextSharp to create PDFs from Docx. It worked very well and the API was easy to grok. We even added password encryption and embedded watermarks to the PDFs.
I am currently working on a program in which a user should be able to merge several Word documents into one, without losing any formatting, headers and so on. The documents should simply stack up, one after another, without any changes.
Here is my current code:
public virtual Byte[] MergeWordFiles(IEnumerable<SendData> sourceFiles)
{
int f = 0;
// If only one Word document then skip merge.
if (sourceFiles.Count() == 1)
{
return sourceFiles.First().File;
}
else
{
MemoryStream destinationFile = new MemoryStream();
// Add first file
var firstFile = sourceFiles.First().File;
destinationFile.Write(firstFile, 0, firstFile.Length);
destinationFile.Position = 0;
int pointer = 1;
byte[] ret;
// Add the rest of the files
try
{
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(destinationFile, true))
{
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
for (pointer = 1; pointer < sourceFiles.Count(); pointer++)
{
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(sourceFiles.ElementAt(pointer).File), true);
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
newBody.Add(XElement.Parse(new DocumentFormat.OpenXml.Wordprocessing.Paragraph(new Run(new Break { Type = BreakValues.Page })).OuterXml));
newBody.Add(tempBody);
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
}
}
}
catch (OpenXmlPackageException oxmle)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), oxmle);
}
catch (Exception e)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), e);
}
finally
{
ret = destinationFile.ToArray();
destinationFile.Close();
destinationFile.Dispose();
}
return ret;
}
}
The problem here is that the formatting is copied from the first document and applied to all the rest, meaning that for instance a different header in the second document will be ignored. How do I prevent this?
I have been looking in to breaking the document in to sections using SectionMarkValues.NextPage, as well as using altChunk.
The problem with the latter is altChunk does not seem to be able to handle a MemoryStream into its "FeedData" method.
DocIO is a .NET library that can read, write, merge and render Word 2003/2007/2010/2013/2016 files. The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Step 1: Create a console application
Step 2: Add reference to Syncfusion.DocIO.Base, Syncfusion.Compression.Base and Syncfusion.OfficeChart.Base; You can add these reference to your project using NuGet also.
Step 3: Copy & paste the following code snippet.
This code snippet will produce the document as per your requirement; each input Word document will get merged with its original formatting, styles and headers/footer.
using Syncfusion.DocIO.DLS;
using Syncfusion.DocIO;
using System.IO;
namespace DocIO_MergeDocument
{
class Program
{
static void Main(string[] args)
{
//Boolean to indicate whether any of the input document has different odd and even headers as true
bool isDifferentOddAndEvenPagesEnabled = false;
// Creating a new document.
using (WordDocument mergedDocument = new WordDocument())
{
//Get the files from input directory
DirectoryInfo dirInfo = new DirectoryInfo(System.Environment.CurrentDirectory + #"\..\..\Data");
FileInfo[] fileInfo = dirInfo.GetFiles();
for (int i = 0; i < fileInfo.Length; i++)
{
if (fileInfo[i].Extension == ".doc" || fileInfo[i].Extension == ".docx")
{
using (WordDocument sourceDocument = new WordDocument(fileInfo[i].FullName))
{
//Check whether the document has different odd and even header/footer
if (!isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in sourceDocument.Sections)
{
isDifferentOddAndEvenPagesEnabled = section.PageSetup.DifferentOddAndEvenPages;
if (isDifferentOddAndEvenPagesEnabled)
break;
}
}
//Sets the breakcode of First section of source document as NoBreak to avoid imported from a new page
sourceDocument.Sections[0].BreakCode = SectionBreakCode.EvenPage;
//Imports the contents of source document at the end of merged document
mergedDocument.ImportContent(sourceDocument, ImportOptions.KeepSourceFormatting);
}
}
}
//if any of the input document has different odd and even headers as true then
//Copy the content of the odd header/foort and add the copied content into the even header/footer
if (isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in mergedDocument.Sections)
{
section.PageSetup.DifferentOddAndEvenPages = true;
if (section.HeadersFooters.OddHeader.Count > 0 && section.HeadersFooters.EvenHeader.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddHeader.Count; i++)
section.HeadersFooters.EvenHeader.ChildEntities.Add(section.HeadersFooters.OddHeader.ChildEntities[i].Clone());
}
if (section.HeadersFooters.OddFooter.Count > 0 && section.HeadersFooters.EvenFooter.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddFooter.Count; i++)
section.HeadersFooters.EvenFooter.ChildEntities.Add(section.HeadersFooters.OddFooter.ChildEntities[i].Clone());
}
}
}
//If there is no document to merge then add empty section with empty paragraph
if (mergedDocument.Sections.Count == 0)
mergedDocument.EnsureMinimal();
//Saves the document in the given name and format
mergedDocument.Save("result.docx", FormatType.Docx);
}
}
}
}
Downloadable Demo
Note: There is a Word document (not section) level settings for
applying different header/footer for odd and even pages. Each input
document can have different values for this property. if any of the
input document has different odd and even header/footer as true, it
will affect the visual appearance of header/footer in the resultant
document. Hence, if any of the input document has different odd and
even header/footer, then the resultant Word document will have been
replaced with the odd header/footer contents.
For further information about DocIO, please refer our help documentation
Note: I work for Syncfusion
I'm developing a web application with asp.net and I have a file called Template.docx that works like a template to generate other reports. Inside this Template.docx I have some MergeFields (Title, CustomerName, Content, Footer, etc) to replace for some dynamic content in C#.
I would like to know, how can I put a content in a mergefield in docx ?
I don't know if MergeFields is the right way to do this or if there is another way. If you can suggest me, I appreciate!
PS: I have openxml referenced in my web application.
Edits:
private MemoryStream LoadFileIntoStream(string fileName)
{
MemoryStream memoryStream = new MemoryStream();
using (FileStream fileStream = File.OpenRead(fileName))
{
memoryStream.SetLength(fileStream.Length);
fileStream.Read(memoryStream.GetBuffer(), 0, (int) fileStream.Length);
memoryStream.Flush();
fileStream.Close();
}
return memoryStream;
}
public MemoryStream GenerateWord()
{
string templateDoc = "C:\\temp\\template.docx";
string reportFileName = "C:\\temp\\result.docx";
var reportStream = LoadFileIntoStream(templateDoc);
// Copy a new file name from template file
//File.Copy(templateDoc, reportFileName, true);
// Open the new Package
Package pkg = Package.Open(reportStream, FileMode.Open, FileAccess.ReadWrite);
// Specify the URI of the part to be read
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart part = pkg.GetPart(uri);
XmlDocument xmlMainXMLDoc = new XmlDocument();
xmlMainXMLDoc.Load(part.GetStream(FileMode.Open, FileAccess.Read));
// replace some keys inside xml (it will come from database, it's just a test)
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_customer", "My Customer Name");
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_title", "Report of Documents");
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_content", "Content of Document");
// Open the stream to write document
StreamWriter partWrt = new StreamWriter(part.GetStream(FileMode.Open, FileAccess.Write));
//doc.Save(partWrt);
xmlMainXMLDoc.Save(partWrt);
partWrt.Flush();
partWrt.Close();
reportStream.Flush();
pkg.Close();
return reportStream;
}
PS: When I convert MemoryStream to a file, I got a corrupted file. Thanks!
I know this is an old post, but I could not get the accepted answer to work for me. The project linked would not even compile (which someone has already commented in that link). Also, it seems to use other Nuget packages like WPFToolkit.
So I'm adding my answer here in case someone finds it useful. This only uses the OpenXML SDK 2.5 and also the WindowsBase v4. This works on MS Word 2010 and later.
string sourceFile = #"C:\Template.docx";
string targetFile = #"C:\Result.docx";
File.Copy(sourceFile, targetFile, true);
using (WordprocessingDocument document = WordprocessingDocument.Open(targetFile, true))
{
// If your sourceFile is a different type (e.g., .DOTX), you will need to change the target type like so:
document.ChangeDocumentType(WordprocessingDocumentType.Document);
// Get the MainPart of the document
MainDocumentPart mainPart = document.MainDocumentPart;
var mergeFields = mainPart.RootElement.Descendants<FieldCode>();
var mergeFieldName = "SenderFullName";
var replacementText = "John Smith";
ReplaceMergeFieldWithText(mergeFields, mergeFieldName, replacementText);
// Save the document
mainPart.Document.Save();
}
private void ReplaceMergeFieldWithText(IEnumerable<FieldCode> fields, string mergeFieldName, string replacementText)
{
var field = fields
.Where(f => f.InnerText.Contains(mergeFieldName))
.FirstOrDefault();
if (field != null)
{
// Get the Run that contains our FieldCode
// Then get the parent container of this Run
Run rFldCode = (Run)field.Parent;
// Get the three (3) other Runs that make up our merge field
Run rBegin = rFldCode.PreviousSibling<Run>();
Run rSep = rFldCode.NextSibling<Run>();
Run rText = rSep.NextSibling<Run>();
Run rEnd = rText.NextSibling<Run>();
// Get the Run that holds the Text element for our merge field
// Get the Text element and replace the text content
Text t = rText.GetFirstChild<Text>();
t.Text = replacementText;
// Remove all the four (4) Runs for our merge field
rFldCode.Remove();
rBegin.Remove();
rSep.Remove();
rEnd.Remove();
}
}
What the code above does is basically this:
Identify the 4 Runs that make up the merge field named "SenderFullName".
Identify the Run that contains the Text element for our merge field.
Remove the 4 Runs.
Update the text property of the Text element for our merge field.
UPDATE
For anyone interested, here is a simple static class I used to help me with replacing merge fields.
Frank Fajardo's answer was 99% of the way there for me, but it is important to note that MERGEFIELDS can be SimpleFields or FieldCodes.
In the case of SimpleFields, the text runs displayed to the user in the document are children of the SimpleField.
In the case of FieldCodes, the text runs shown to the user are between the runs containing FieldChars with the Separate and the End FieldCharValues. Occasionally, several text containing runs exist between the Separate and End Elements.
The code below deals with these problems. Further details of how to get all the MERGEFIELDS from the document, including the header and footer is available in a GitHub repository at https://github.com/mcshaz/SimPlanner/blob/master/SP.DTOs/Utilities/OpenXmlExtensions.cs
private static Run CreateSimpleTextRun(string text)
{
Run returnVar = new Run();
RunProperties runProp = new RunProperties();
runProp.Append(new NoProof());
returnVar.Append(runProp);
returnVar.Append(new Text() { Text = text });
return returnVar;
}
private static void InsertMergeFieldText(OpenXmlElement field, string replacementText)
{
var sf = field as SimpleField;
if (sf != null)
{
var textChildren = sf.Descendants<Text>();
textChildren.First().Text = replacementText;
foreach (var others in textChildren.Skip(1))
{
others.Remove();
}
}
else
{
var runs = GetAssociatedRuns((FieldCode)field);
var rEnd = runs[runs.Count - 1];
foreach (var r in runs
.SkipWhile(r => !r.ContainsCharType(FieldCharValues.Separate))
.Skip(1)
.TakeWhile(r=>r!= rEnd))
{
r.Remove();
}
rEnd.InsertBeforeSelf(CreateSimpleTextRun(replacementText));
}
}
private static IList<Run> GetAssociatedRuns(FieldCode fieldCode)
{
Run rFieldCode = (Run)fieldCode.Parent;
Run rBegin = rFieldCode.PreviousSibling<Run>();
Run rCurrent = rFieldCode.NextSibling<Run>();
var runs = new List<Run>(new[] { rBegin, rCurrent });
while (!rCurrent.ContainsCharType(FieldCharValues.End))
{
rCurrent = rCurrent.NextSibling<Run>();
runs.Add(rCurrent);
};
return runs;
}
private static bool ContainsCharType(this Run run, FieldCharValues fieldCharType)
{
var fc = run.GetFirstChild<FieldChar>();
return fc == null
? false
: fc.FieldCharType.Value == fieldCharType;
}
You could try http://www.codeproject.com/KB/office/Fill_Mergefields.aspx which uses the Open XML SDK to do this.
I am trying the following code. It takes a fileName (docx file with many sections) and I try to iterate through each section getting the section name. The problem is that I end up with unreadable docx files. It does not error, but I think I am doing something wrong with getting the elements in the section.
public void Split(string fileName) {
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(fileName, true)) {
string curCliCode = "";
MainDocumentPart mdp = myDoc.MainDocumentPart;
foreach (var element in mdp.Document.Body.ChildElements) {
if (element.Descendants().OfType<SectionProperties>().Count() == 1) {
//get the name of the section from the footer
var footer = (FooterPart) mdp.GetPartById(
element.Descendants().OfType<SectionProperties>().First().OfType
<FooterReference>().First().
Id.Value);
foreach (Paragraph p in footer.Footer.ChildElements.OfType<Paragraph>()) {
if (p.InnerText != "") {
curCliCode = p.InnerText;
}
}
if (curCliCode != "") {
var forFile = new List<OpenXmlElement>();
var els = element.ElementsBefore();
if (els != null) {
foreach (var e in els) {
if (e != null) {
forFile.Add(e);
}
}
for (int i = 0; i < els.Count(); i++) {
els.ElementAt(i).Remove();
}
}
Create(curCliCode, forFile);
}
}
}
}
}
private void Create(string cliCode,IEnumerable<OpenXmlElement> docParts) {
var parts = from e in docParts select e.Clone();
const string template = #"\Test\toSplit\blank.docx";
string destination = string.Format(#"\Test\{0}.docx", cliCode);
File.Copy(template, destination,true);
/* Create the package and main document part */
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(destination, true)) {
MainDocumentPart mainPart = myDoc.MainDocumentPart;
/* Create the contents */
foreach(var part in parts) {
mainPart.Document.Body.Append((OpenXmlElement)part);
}
/* Save the results and close */
mainPart.Document.Save();
myDoc.Close();
}
}
Does anyone know what the problem could be (or how to properly copy a section from one document to another)?
I've done some work in this area, and what I have found invaluable is diffing a known good file with a prospective file; the error is usually fairly obvious.
What I would do is take a file that you know works, and copy all of the sections into the template. Theoretically, the two files should be identical. Run a diff on them the document.xml inside the docx file, and you'll see the difference.
BTW, I'm assuming that you know that the docx is actually a zip; change the extension to "zip", and you'll be able to get at the actual xml files which compose the format.
As far as diff tools, I use Beyond Compare from Scooter Software.
An approach along the lines of what you are doing will work only for simple documents (ie those not containing images, hyperlinks, comments etc). To handle these more complex documents, take a look at http://blogs.msdn.com/b/ericwhite/archive/2009/02/05/move-insert-delete-paragraphs-in-word-processing-documents-using-the-open-xml-sdk.aspx and the resulting DocumentBuilder API (part of the PowerTools for Open XML project on CodePlex).
In order to split a docx into sections using DocumentBuilder, you'll still need to first find the index of the paragraphs containing sectPr elements.