Reading Word Bookmarks using Open XML and C#

Reading Word Bookmarks using Open XML and C# - c#

I have read everything I can find that is even remotely related to this (including Read Word bookmarks), but have not been able to get anything to work.
I am trying to walk through a Word document that has bookmarks in it, and get the values for each of the bookmarks. I can walk the document and get the names of the bookmarks, but cannot figure out how to get the value/text of the bookmark.
Here is what I am using to get the bookmark names:
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(athleteFile, false))
{
foreach (BookmarkStart bookmark in wordDocument.MainDocumentPart.Document.Body.Descendants<BookmarkStart>())
{
System.Diagnostics.Debug.WriteLine(bookmark.Name + " - " + bookmark.InnerText);
}
}

First of all I'd highly recommend you use the Open XML SDK 2.5 Productivity tool, that way you'll have a better idea of what you're working with.
Secondly a bookmark in Word does not have any value associated with it. It is usually marks a location in the word document. So what you're trying to do wont work.
<w:bookmarkStart w:name="bkStart" w:id="0" />
that is the XML element that is created in the docx file when you add a bookmark to the document.

Solution 1:
Get the bookmarks text by accessing its parent's inner text:
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(athleteFile, false))
{
foreach (BookmarkStart bookmark in wordDocument.MainDocumentPart.Document.Body.Descendants<BookmarkStart>())
{
// Get name of bookmark
string bookmarkNameOriginal = bookmark.Name;
// Get bookmark text from parent elements text
string bookmarkText = bookmark.Parent.InnerText;
}
}
Solution 2:
I have found another solution using DocX by Xceed.
Note:
Reading bookmarks is slow in free version (v1.3 Docx). However it is fixed in v1.4 of Docx (free version is slower to get this update).
Import DocX:
using Xceed.Words.NET;
Create method to read bookmark name and text:
/// <summary>
/// Read bookmark text/names in word document
/// </summary>
/// <param name="filePath"></param>
/// <remarks>
/// Uses free DocX by Xceed
/// </remarks>
public void ReadBookmarks(string filePath)
{
//Load document
using (DocX Document = DocX.Load(filePath))
{
//This is slow in free version (v1.3 Docx), is fixed in v1.4Docx (free version is slower to get this)
BookmarkCollection bookmarks = Document.Bookmarks;
//Iterate over bookmarks in document
foreach (Bookmark bookmark in bookmarks) {
//Name of bookmark
string bookmarkName = bookmark.Name;
//Text of bookmark, usually a word heading (1, 2, 3...)
string bookmarkText = bookmark.Paragraph.Text;
}
}
}

Related

How to convert Multiple HTML Pages to Single Doc in c#

I am converting a single HTML page to Doc using spire doc. I need to convert multiple html pages from single folder to single Doc. How this can be done. Can anyone give some idea or any library available to achieve this?
Please find my code to convert single HTML to Doc.
Spire.Doc.Document document = new Spire.Doc.Document();
document.LoadFromFile(#"D:\DocFilesConvert\htmlfile.html", Spire.Doc.FileFormat.Html, XHTMLValidationType.None);
document.SaveToFile(#"D:\DocFilesConvert\docfiless.docx", Spire.Doc.FileFormat.Docx);

There seems no direct way to achieve this. One workaround I find is to convert each HTML document to a single Word file, and then merge these Word files in one file.
//get HTML file paths
string[] htmlfilePaths = new string[]{
#"F:\Documents\Html\1.html",
#"F:\Documents\Html\2.html",
#"F:\Documents\Html\3.html"
};
//create Document array
Document[] docs = new Document[htmlfilePaths.Length];
for (int i = 0; i < htmlfilePaths.Length; i++)
{
//load each HTML to a sperate Word file
docs[i] = new Document(htmlfilePaths[i], FileFormat.Html);
//combine these Word files in one file
if (i>=1)
{
foreach (Section sec in docs[i].Sections)
{
docs[0].Sections.Add(sec.Clone());
}
}
}
//save to a Word document
docs[0].SaveToFile("output.docx", FileFormat.Docx2013);

How to access OpenXML content by page number?

Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.

List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

Word Interop Join documents in memory

I have a List that I would like to join in a single Word.Document. Below is all that I have so far.
Any ideas?
public static Word.Document JoinDocuments(List<Word.Document> DocstoJoin)
{
Word.Document JoinedDoc = new Word.Document();
foreach (Word.Document doc in DocstoJoin)
{
foreach (Word.Section sec in doc.Sections)
{
**????**
}
}
return JoinedDoc;
}

The Selection class provides the following methods that can be used to get the job done:
Copy - copies the specified selection to the Clipboard.
Paste - inserts the contents of the Clipboard at the specified selection.
Also you may consider using the Open XML SDK if you deal with open XML documents only.

Read bookmarks in outlook MSG file with C#

My goal is to somehow be able to read bookmarks in an outlook .msg file, then replace them with a different text. I want to do this with C#.
I know how to access the body and change the text, but was wondering if there was a way to access directly the list of all the bookmarks and its location so that i can easily replace them, instead going through the whole body text, splitting it up, etc etc...
edit: this is how a bookmark window looks like from this window one can assign bookmarks, but it should be possible to obtain this list via c#.
Any relevant info is appreciated.
Thanks in advance.

Since Outlook most often uses Word as it's body editor - you need to add a project reference to Microsoft.Office.Interop.Word.dll and then access to the Outlook Inspector's WordEditor during the Inspector.Activate event. Once you have access to the Word.Document - it's trivial to load up the Bookmarks and access/modify their values.
Outlook.Inspector inspector = Globals.ThisAddIn.Application.ActiveInspector();
((Outlook.InspectorEvents_10_Event)inspector).Activate += () =>
{ // validation to ensure we are using Word Editor
if (inspector.EditorType == Outlook.OlEditorType.olEditorWord && inspector.IsWordMail())
{
Word.Document wordDoc = inspector.WordEditor as Word.Document;
if (wordDoc != null)
{
var bookmarks = wordDoc.Bookmarks;
foreach (Word.Bookmark item in bookmarks)
{
string name = item.Name; // bookmark name
Word.Range bookmarkRange = item.Range; // bookmark range
string bookmarkText = bookmarkRange.Text; // bookmark text
item.Select(); // triggers bookmark selection
}
}
}
};

Server-side word automation

I am looking for alternatives to using openxml for a server-side word automation project. Does anyone know any other ways that have features to let me manipulate word bookmarks and tables?

I am currently doing a project of developing a word automation project for my company and I am using DocX Very simple and straight forward API to work with. The approach I am using is, whenever I need to work with XML directly, this API has a property named "xml" in the Paragraph class which gives you access to the underlying xml direclty so that I can work with it. The best part is its not breaking the xml and not corrupting the resulting document. Hope this helps!
Example code using DocX..
XNamespace ns = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
using(DocX doc = DocX.Load(#"c:\temp\yourdoc.docx"))
{
foreach( Paragraph para in doc.Paragraphs )
{
if(para.Xml.ToString().Contains("w:Bookmark"))
{
if(para.Xml.Element(ns + "BookmarkStart").Attribute("Name").Value == "yourbookmarkname")
{
// you got to your bookmark, if you want to change the text..then
para.Xml.Elements(ns + "t").FirstOrDefault().SetValue("Text to replace..");
}
}
}
}
Alternative API exclusively to work with bookmarks is .. http://simpleooxml.codeplex.com/
Example on how to delete text from bookmarkstart to bookmarkend using this API..
MemoryStream stream = DocumentReader.Copy(string.Format("{0}\\template.docx", TestContext.TestDeploymentDir));
WordprocessingDocument doc = WordprocessingDocument.Open(stream, true);
MainDocumentPart mainPart = doc.MainDocumentPart;
DocumentWriter writer = new DocumentWriter(mainPart);
//Simply Clears all text between bookmarkstart and end
writer.PasteText("", "YourBookMarkName");
//Save to the memory stream, and then to a file
writer.Save();
DocumentWriter.StreamToFile(string.Format("{0}\\templatetest.docx", GetOutputFolder()), stream);
Loading the word document into different API's from memory stream.
//Loading a document file into memorystream using SimpleOOXML API
MemoryStream stream = DocumentReader.Copy(#"c\template.docx");
//Opening it from the memory stream as OpenXML document
WordprocessingDocument doc = WordprocessingDocument.Open(stream, true);
//Opening it as DocX document for working with DocX Api
DocX document = DocX.Load(stream);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading Word Bookmarks using Open XML and C# - c#

Related

How to convert Multiple HTML Pages to Single Doc in c#

How to access OpenXML content by page number?

Word Interop Join documents in memory

Read bookmarks in outlook MSG file with C#

Server-side word automation

Categories

Resources