Word Interop Join documents in memory

Word Interop Join documents in memory - c#

I have a List that I would like to join in a single Word.Document. Below is all that I have so far.
Any ideas?
public static Word.Document JoinDocuments(List<Word.Document> DocstoJoin)
{
Word.Document JoinedDoc = new Word.Document();
foreach (Word.Document doc in DocstoJoin)
{
foreach (Word.Section sec in doc.Sections)
{
**????**
}
}
return JoinedDoc;
}

The Selection class provides the following methods that can be used to get the job done:
Copy - copies the specified selection to the Clipboard.
Paste - inserts the contents of the Clipboard at the specified selection.
Also you may consider using the Open XML SDK if you deal with open XML documents only.

Related

OpenXML SDK: How to get "Tag" (p:tag) val of a PowerPoint shape?

I need to check all tags on all shapes on all slides. I can select each shape, however I can't see how to get the shape's tags.
For the given DocumentFormat.OpenXml.Presentation.Shape, how can I get the "val" of the tag with name="MOUNTAIN"
In my shape, the tag rId is in this structure: p:sp > p:nvSpPr > p:cNvPr > p:nvPr > p:custDataList > p:tags
I'm guessing my code needs to do these steps:
• Get the rId of the p:custDataLst p:tags
• Look up the "Target" file name in the slideX.xml.rels file, based on the rId
• Look in the root/tags folder for the "Target" file
• Get the p:tagLst p:tags and look for the p:tag with name="MOUNTAIN"
<p:tagLst
<p:tag name="MOUNTAIN" val="Denali"/>
</p:tagLst>
Here is how my code iterates through shapes on each slide:
for (int x = 0; x < doc.PresentationPart.SlideParts.Count(); x++)
{
SlidePart slide = doc.PresentationPart.SlideParts.ElementAt(x);
ShapeTree tree = slide.Slide.CommonSlideData.ShapeTree;
IEnumerable<DocumentFormat.OpenXml.Presentation.Shape> slShapes = slide.Slide.Descendants<DocumentFormat.OpenXml.Presentation.Shape>();
foreach (DocumentFormat.OpenXml.Presentation.Shape shape in slShapes)
{
//get the specified tag, if it exists
}
}
I see an example of how to add tags: How to add custom tags to powerpoint slides using OpenXml in c#
But I can't figure out how to read the existing tags.
So, how do I get the shape's tags with c#?
I was hoping to do something like this:
IEnumerable<UserDefinedTagsPart> userDefinedTagsParts = shape.NonVisualShapeProperties.ApplicationNonVisualDrawingProperties.CustomerDataList.CustomerDataTags<UserDefinedTagsPart>();
foreach (UserDefinedTagsPart userDefinedTagsPart in userDefinedTagsParts)
{}
but Visual Studio says "ApplicationNonVisualDrawingProperties does not contain a definition for CustomerDataList".
From the OpenXML Productivity Tool, here is the element tree:

You and I seem to be working on similar problems. I'm struggling with learning the file format. The following code is working for me, I'm sure it can be optimized.
public void ReadTags(Shape shape, SlidePart slidePart)
{
NonVisualShapeProperties nvsp = shape.NonVisualShapeProperties;
ApplicationNonVisualDrawingProperties nvdp = nvsp.ApplicationNonVisualDrawingProperties;
IEnumerable<CustomerDataTags> data_tags = nvdp.Descendants<CustomerDataTags>();
foreach (var data_tag in data_tags)
{
UserDefinedTagsPart shape_tags = slidePart.GetPartById(data_tag.Id) as UserDefinedTagsPart;
if (shape_tags != null)
{
foreach (Tag tag in shape_tags.TagList)
{
Debug.Print($"\t{nvsp.NonVisualDrawingProperties.Name} tag {tag.Name} = '{tag.Val}");
}
}
}
}

I've spent a lot of time with OpenXML .docx and .xlsx files ... but not so much with .pptx.
Nevertheless, here are a couple of suggestions that might help:
If you haven't already done so, please downoad the OpenXML SDK Productivity Tool to analyze your file's contents. It's currently available on GitHub:
https://github.com/dotnet/Open-XML-SDK/releases/tag/v2.5
You might simply be able to "grep" for items you're looking for.
EXAMPLE (Word, not PowerPoint... but the same principle should apply):
using (doc = WordprocessingDocument.Open(stream, true))
{
// Init OpenXML members
mainPart = doc.MainDocumentPart;
body = mainPart.Document.Body;
...
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains(target))
...

How to solve the error of Word opening in background when trying to read text from Word documents?

I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}

When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.
You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.
The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :
using (var document = WordprocessingDocument.Open(fileName, false))
{
var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}
You'll find the Open XML documentation, including the strucrure of Word documents at MSDN
Avoiding Owner Files
Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.
To avoid these one only needs to check whether the filename starts with ~.
If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")
Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :
var dir=new DirectoryInfo(folderPath);
foreach(var file in dir.EnumerateFiles("*.docx"))
{
if (!file.Name.StartsWith("~"))
{
...
}
}
or, using LINQ :
var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
.Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
...
}
Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:
var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions
{
IgnoreInaccessible =true
};
var files=dir.EnumerateFiles("*.docx",options)
.Where(file=>!file.Name.StartsWith("~"));
One option is to
List item
List item

Replace text in docx file with content of another docx file

I'm trying to use OpenXml to replace a text "Veteran" in file A.docx with content in B.docx . If B.docx contains text or paragraph , it works fine and I get modified A.docx file.
However, if B.docx contains a table, then the code doesn't work.
static void Main(string[] args)
{
SearchAndReplace(#"C:\A.docx", #"C:\B.docx");
}
public static void SearchAndReplace(string docTo, string docFrom)
{
List<WordprocessingDocument> docList = new List<WordprocessingDocument>();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(docTo, true))
using (WordprocessingDocument wordDoc1 = WordprocessingDocument.Open(docFrom, true))
{
var parts = wordDoc1.MainDocumentPart.Document.Descendants().FirstOrDefault();
docList.Add(wordDoc);
docList.Add(wordDoc1);
if (parts != null)
{
foreach (var node in parts.ChildElements)
{
if (node is Table)
{
ParseTable(docList, (Table)node, textBuilder);
}
}
}
}
}
public static void ParseText(List<WordprocessingDocument> wpd, Paragraph node, StringBuilder textBuilder)
{
Body body = wpd[0].MainDocumentPart.Document.Body;
Body body1 = wpd[1].MainDocumentPart.Document.Body;
string content = body1.InnerXml;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("Veteran"))
{
run.InnerXml.Replace(run.InnerXml, content);
break;
}
}
}
}
}
public static void ParseTable(List<WordprocessingDocument> wpd, Table node, StringBuilder textBuilder)
{
foreach (var row in node.Descendants<TableRow>())
{
textBuilder.Append("| ");
foreach (var cell in row.Descendants<TableCell>())
{
foreach (var para in cell.Descendants<Paragraph>())
{
ParseText(wpd, para, textBuilder);
}
textBuilder.Append(" | ");
}
textBuilder.AppendLine("");
}
}
}
}
How to make this work ? Is there a better way to replace content with another docx file?

Not having enough detail for a specific answer, here's how you solve such problems in general:
Ensure you understand the Open XML specification and valid Open XML markup on an appropriate level of detail.
If you don't understand what w:document, w:body, w:p, w:r, w:t, w:tbl, etc. are and how they relate to each other, you have no chance.
You must look at actual Open XML markup, e.g., using the Open XML Productivity Tool or the Open XML Package Editor for Modern Visual Studios to get to an appropriate level of understanding and develop Open XML-based solutions.
Understand that most Open XML-related code transforms some source markup into some target markup. Therefore, you must:
understand the source and target markup first and then
define the transformation required to create the target from the source.
Depending on what you need to do, the Open XML Productivity Tool can help create the transforming code. If you have a source and target document, you can use the Productivity Tool to compare those documents. This shows the difference in the markup, so you see what markup is created, deleted, or changed. It even shows you the Open XML SDK-based code required to effect the change.
In my own use cases, I typically prefer to write recursive, pure functional transformations. While you need to wrap your head around the concept, this is an extremely powerful approach.
In your case, you should:
take a few representative, manually-created samples of source (A.docx with "Vetaran" still to be replaced) and target (A.docx with "Veteran" replaced as desired) documents;
look at the Open XML markup of the source and target documents; and
write code that creates the target markup.
Once you have created code that at least tries to create valid target Open XML markup, you could come back with further questions in case you identify further issues.

Reading Word Bookmarks using Open XML and C#

I have read everything I can find that is even remotely related to this (including Read Word bookmarks), but have not been able to get anything to work.
I am trying to walk through a Word document that has bookmarks in it, and get the values for each of the bookmarks. I can walk the document and get the names of the bookmarks, but cannot figure out how to get the value/text of the bookmark.
Here is what I am using to get the bookmark names:
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(athleteFile, false))
{
foreach (BookmarkStart bookmark in wordDocument.MainDocumentPart.Document.Body.Descendants<BookmarkStart>())
{
System.Diagnostics.Debug.WriteLine(bookmark.Name + " - " + bookmark.InnerText);
}
}

First of all I'd highly recommend you use the Open XML SDK 2.5 Productivity tool, that way you'll have a better idea of what you're working with.
Secondly a bookmark in Word does not have any value associated with it. It is usually marks a location in the word document. So what you're trying to do wont work.
<w:bookmarkStart w:name="bkStart" w:id="0" />
that is the XML element that is created in the docx file when you add a bookmark to the document.

Solution 1:
Get the bookmarks text by accessing its parent's inner text:
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(athleteFile, false))
{
foreach (BookmarkStart bookmark in wordDocument.MainDocumentPart.Document.Body.Descendants<BookmarkStart>())
{
// Get name of bookmark
string bookmarkNameOriginal = bookmark.Name;
// Get bookmark text from parent elements text
string bookmarkText = bookmark.Parent.InnerText;
}
}
Solution 2:
I have found another solution using DocX by Xceed.
Note:
Reading bookmarks is slow in free version (v1.3 Docx). However it is fixed in v1.4 of Docx (free version is slower to get this update).
Import DocX:
using Xceed.Words.NET;
Create method to read bookmark name and text:
/// <summary>
/// Read bookmark text/names in word document
/// </summary>
/// <param name="filePath"></param>
/// <remarks>
/// Uses free DocX by Xceed
/// </remarks>
public void ReadBookmarks(string filePath)
{
//Load document
using (DocX Document = DocX.Load(filePath))
{
//This is slow in free version (v1.3 Docx), is fixed in v1.4Docx (free version is slower to get this)
BookmarkCollection bookmarks = Document.Bookmarks;
//Iterate over bookmarks in document
foreach (Bookmark bookmark in bookmarks) {
//Name of bookmark
string bookmarkName = bookmark.Name;
//Text of bookmark, usually a word heading (1, 2, 3...)
string bookmarkText = bookmark.Paragraph.Text;
}
}
}

Read bookmarks in outlook MSG file with C#

My goal is to somehow be able to read bookmarks in an outlook .msg file, then replace them with a different text. I want to do this with C#.
I know how to access the body and change the text, but was wondering if there was a way to access directly the list of all the bookmarks and its location so that i can easily replace them, instead going through the whole body text, splitting it up, etc etc...
edit: this is how a bookmark window looks like from this window one can assign bookmarks, but it should be possible to obtain this list via c#.
Any relevant info is appreciated.
Thanks in advance.

Since Outlook most often uses Word as it's body editor - you need to add a project reference to Microsoft.Office.Interop.Word.dll and then access to the Outlook Inspector's WordEditor during the Inspector.Activate event. Once you have access to the Word.Document - it's trivial to load up the Bookmarks and access/modify their values.
Outlook.Inspector inspector = Globals.ThisAddIn.Application.ActiveInspector();
((Outlook.InspectorEvents_10_Event)inspector).Activate += () =>
{ // validation to ensure we are using Word Editor
if (inspector.EditorType == Outlook.OlEditorType.olEditorWord && inspector.IsWordMail())
{
Word.Document wordDoc = inspector.WordEditor as Word.Document;
if (wordDoc != null)
{
var bookmarks = wordDoc.Bookmarks;
foreach (Word.Bookmark item in bookmarks)
{
string name = item.Name; // bookmark name
Word.Range bookmarkRange = item.Range; // bookmark range
string bookmarkText = bookmarkRange.Text; // bookmark text
item.Select(); // triggers bookmark selection
}
}
}
};

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.