Ordering bookmarks by page using microsoft interop c#

Ordering bookmarks by page using microsoft interop c# - c#

I have a template word file composed by 2 pages, each page has a bookmark, the first page bookmark name is A4 and the second page bookmark name is A3, but when I read all bookmarks from the word document I get them in alphabetical order, I want them in page order, how can i do this?
foreach (Bookmark bookMark in MergeResultDoc.Bookmarks)
{//IMPORTANTE:IL NOME DEL SEGNALIBRO DEVE ESSERE IL TIPO DI CARTA
pagInizio = Convert.ToInt32(pagNum);
pagNum = bookMark.Range.Information[WdInformation.wdActiveEndPageNumber].ToString();
addData( pagInizio, pagNum, bookMark.Name);
iteration++;
}

You can read the bookMark.Start value.
This returns the start position of the Bookmark in the document.
So you can run over all Bookmarks and sort them by their start position.
Here is a code to do that:
// List to store all bookmarks sorted by position.
List<Bookmark> bmList = new List<Bookmark>();
// Iterate over all the Bookmarks and add them to the list (unordered).
foreach (Bookmark curBookmark in MergeResultDoc.Bookmarks)
{
bmList.Add(curBookmark);
}
// Sort the List by the Start member of each Bookmark.
// After this line the bmList will be ordered.
bmList.Sort(delegate(Bookmark bm1, Bookmark bm2)
{
return bm1.Start.CompareTo(bm2.Start);
});

Use LINQ OrderBy:
var orderedResults = MergeResultDoc.Bookmarks.OrderBy(d => d.Start).ToList();

Document.Boomarks should return the bookmarks in alpha sequence.
Document.Content.Bookmarks should return the bookmarks in the sequence they appear in the document. But VBA collection documentation does not typically guarantee a particular sequence for anything, it's safer to read the Start (as suggested by etaiso) and sort using that.

Related

OpenXML Remove text from template

I have a number of .docx templates that customers download, but certain words need to be changed or removed from the document for different customers. I can't find anything on how to remove text:-
using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
{
foreach (Text element in doc.MainDocumentPart.Document.Body.Descendants<Text>())
{
//This is fine
element.Text = element.Text.Replace("DocumentDate", wordReferenceTemplatesMV.DocumentDate)
//Need help on how to remove text
element.Text = element.Text.Remove???("TextToRemove")
}

Why not just replace it with an empty string?
element.Text = element.Text.Replace("TextToRemove", string.Empty);

Most text values are in Run element. Basically you can run through all the Run elements and check its text. it should be something like:
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
foreach (Run r in body.Descendants<Run>())
{
string sText = r.InnerText ;
//...compare the text with the value
//note sometime, you could see the text be broken into two runs, you need to find a way based on your requirements and connect them. }
if you want to delete the text, you can just delete the run.
call the run's remove() method.
r.Remove();
More details about Runs and text object,
If you use the file as template, usually I will set some special properties on the Run element, so later, I can find them with more accuracy.
for example, inside the run loop, before checking its text, you can check the color first.
if( r.RunProperties.Highlight.Val == DocumentFormat.OpenXml.Wordprocessing.HighlightColorValues.Yellow )
{
string sText = r.InnerText ;
....
}
Hope it helps.

If you don't want the element any more then you can delete the whole element:
using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
{
foreach (Text element in doc.MainDocumentPart.Document.Body.Descendants<Text>())
{
if (element.Text == "TextToRemove")
element.Remove();
}
}
Edit
If you're left with an empty line the chances are you have a Paragraph that contained the Text. In that case you want to remove the Paragraph instead in which case you can do:
if (element.Text == "TextToRemove")
element.Parent.Remove();

I don't think it's the paragraph element causing the empty line when removed.
Clients send over a template with an address block as:-
[address1]
[address2]
[city]
[town]
[state]
[zip]
The fields are populated from the database with the replace function, but if an address doesn't contain an [address2] value, that's what I need removing. If I remove the text, I'm still left with an empty line between [address1] and [city]. The [address2] field isn't in it's own paragraph.

Add Comment in to selected Text in Word Document Using OpenXML c#

I need to use OpenXML to add comments in to a word document. I need to add a comment to a location or word(or multiple words). Normally in a word document openxml return those text as run elements. But the words which I wanted to add a comment is coming with different run elements. So I couldn't add a comment in to the document words which i actually wanted. It means that I couldn't add specific CommentRangeStart and CommentRangeEnd objects.
My current implementation is as below.
foreach (var paragraph in document.MainDocumentPart.Document.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in paragraph.Elements<Run>())
{
var item = run.Elements<Text>().FirstOrDefault(b => b.Text.Trim() == "My words selection to add comment");
if (item != null)
{
run.InsertBefore(new CommentRangeStart() { Id = id }, item);
var cmtEnd = run.InsertAfter(new CommentRangeEnd() { Id = id }, item);
run.InsertAfter(new Run(new CommentReference() { Id = id }), cmtEnd);
}
}
}
More Detail..
<w:r><w:t>This </w:t></w:r>
<w:r><w:t>is </w:t></w:r>
<w:r><w:t>a first paragraph</w:t></w:r>
So how could I add a comment in to text "is a first para" in that case.
Or in some cases openxml document contains run element as below.
<w:r><w:t>This is a first paragraph</w:t></w:r>
So both of these cases how to add a comment in to my specific selection of words. I have added a screenshot here which exactly what i want.

If the style doesn't differ, and if you are allowed to manipulate the doc, you could easily merge all runs in a paragraph, and then isolate the text run.

How to Define a PDF Outline Using MigraDoc

I noticed when using MigraDoc that if I add a paragraph with any of the heading styles (e.g., "Heading1"), an entry is automatically placed in the document outline. My question is, how can I add entries in the document outline without showing the text in the document? Here is an example of my code:
var document = new Document();
var section = document.AddSection();
// The following line adds an entry to the document outline, but it also
// adds a line of text to the current section. How can I add an
// entry to the document outline without adding any text to the page?
var paragraph = section.AddParagraph("TOC Level 1", "Heading1");

I used a hack: added white text on white ground with a font size of 0.001 or so to get outlines that are actually invisible to the user.
For a perfect solution, mix PDFsharp and MigraDoc code. The hack works for me and is much easier to implement.

I realized after reading ThomasH's answer that I am already mixing PDFSharp and MigraDoc code. Since I am utilizing a PdfDocumentRenderer, I was able to add a custom outline to the PdfDocument property of that renderer. Here is an example of what I ended up doing to create a custom outline:
var document = new Document();
// Populate the MigraDoc document here
...
// Render the document
var renderer = new PdfDocumentRenderer(false, PdfFontEmbedding.Always)
{
Document = document
};
renderer.RenderDocument();
// Create the custom outline
var pdfSharpDoc = renderer.PdfDocument;
var rootEntry = pdfSharpDoc.Outlines.Add(
"Level 1 Header", pdfSharpDoc.Pages[0]);
rootEntry.Outlines.Add("Level 2 Header", pdfSharpDoc.Pages[1]);
// Etc.
// Save the document
pdfSharpDoc.Save(outputStream);

I've got a method that is slightly less hacked. Here's the basic method:
1) Add a bookmark, save into a list that bookmark field object and the name of the outline entry. Do not set a paragraph .OutlineLevel (or set as bodytext)
// Defined previously
List<dynamic> Bookmarks = new List<dynamic>();
// In your bookmarking method, P is a Paragraph already created somewhere
Bookmarks.Add(new { Bookmark = P.AddBookmark("C1"), Name = "Chapter 1", Depth = 0 });
2) At the end of your Migradoc layout, before rendering, prepare the pages
pdfwriter.PrepareRenderPages();
3) Build a dictionary of the Bookmark's parent's parent (This will be a paragraph) and pages (pages will be initialized to -1)
var Pages = Bookmarks.Select(x=> ((BookmarkField)x).Bookmark.Parent.Parent).ToDictionary(x=>x, x=>-1);
4) Now fill in those pages by iterating through the objects on each page, finding the match
for (int i = 0; i < pdfwriter.PageCount; i++)
foreach (var s in pdfwriter.DocumentRenderer.GetDocumentObjectsFromPage(i).Where(x=> Pages.ContainsKey(x))
Pages[s] = i-1;
5) You've now got a dictionary of Bookmark's parent's parents to page numbers, with this you can add your outlines directly into the PDFSharp document. This also iterates down the depth-tree, so you can have nested outlines
foreach(dynamic d in Bookmarks)
{
var o = pdfwriter.PdfDocument.Outlines;
for(int i=0;i<d.Depth;i++)
o = o.Last().Outlines;
BookmarkField BK = d.Bookmark;
int PageNumber = Pages[BK.Parent.Parent];
o.Add(d.Name, pdfwriter.PdfDocument.Pages[PageNumber], true, PdfOutlineStyle.Regular);
}

C# openxml removal of paragraph

I am trying to remove paragraph (I'm using some placeholder text to do generation from docx template-like file) from .docx file using OpenXML, but whenever I remove paragraph it breaks the foreach loop which I'm using to iterate trough.
MainDocumentPart mainpart = doc.MainDocumentPart;
IEnumerable<OpenXmlElement> elems = mainPart.Document.Body.Descendants();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
p.RemoveAllChildren();
p.Remove();
}
}
This works, removes my place holder and paragraph it is in, but foreach loop stops iterating. And I need more things to do in my foreach loop.
Is this ok way to remove paragraph in C# using OpenXML and why is my foreach loop stopping or how to make it not stop? Thanks.

This is the "Halloween Problem", so called because it was noticed by some developers on Halloween, and it looked spooky to them. It is the problem of using declarative code (queries) with imperative code (deleting nodes) at the same time. If you think about it, you are iterating though a linked list, and if you start deleting nodes in the linked list, you totally mess up the iterator. A simpler way to avoid this problem is to "materialize" the results of the query in a List, and then you can iterate through the list, and delete nodes at will. The only difference in the following code is that it calls ToList after calling the Descendants axis.
MainDocumentPart mainpart = doc.MainDocumentPart;
IEnumerable<OpenXmlElement> elems = mainPart.Document.Body.Descendants().ToList();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
p.RemoveAllChildren();
p.Remove();
}
}
However, I have to note that I see another bug in your code. There is nothing to stop Word from splitting up that text node into multiple text elements from multiple runs. While in most cases, your code will work fine, sooner or later, you or a user is going to take some action (like selecting a character, and accidentally hitting the bold button on the ribbon) and then your code will no longer work.
If you really want to work at the text level, then you need to use code such as what I introduce in this screen-cast: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/08/04/introducing-textreplacer-a-new-class-for-powertools-for-open-xml.aspx
In fact, you could probably use that code verbatim to handle your use case, I believe.
Another approach, more flexible and powerful, is detailed in:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
While that screen-cast is about PresentationML, the same principles apply to WordprocessingML.
But even better, given that you are using WordprocessingML, is to use content controls. For one approach to document generation, see:
http://ericwhite.com/blog/map/generating-open-xml-wordprocessingml-documents-blog-post-series/
And for lots of information about using content controls in general, see:
http://www.ericwhite.com/blog/content-controls-expanded
-Eric

You have to use two cycles first that stores items you want to delete and second that deletes items.
something like this:
List<Paragraph> paragraphsToDelete = new List<Paragraph>();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
paragraphsToDelete.Add(p);
}
}
foreach (var p in paragraphsToDelete)
{
p.RemoveAllChildren();
p.Remove();
}

Dim elems As IEnumerable(Of OpenXmlElement) = MainPart.Document.Body.Descendants().ToList()
For Each elem As OpenXmlElement In elems
If elem.InnerText.IndexOf("fullname") > 0 Then
elem.RemoveAllChildren()
End If
Next

Lucene.Net - Get distinct categories

I have created the following document:
var document = new Document();
document.Add(new Field("category", "foo", Field.Store.YES, Field.Index.NOT_ANALYZED));
...
I have approx 10M documents which belong to 8 distinct categories. I would like to get all distinct categories (get all documents and read a value of category field) by executing search query. Is that feasible?
Another approach is to create a list of categories at index rebuild and to write these values in database.
Any help would be greatly appreciated!

Check out the IndexReader.Terms() method.
If you give it an empty Term for a field, it will return a TermEnum containing all the terms for that field.
TermEnum terms = indexReader.Terms(new Term("category"));
// enumerate the terms

To extend Beaulac's solution for future use...
To only get unique result set, you must iterate through terms like this:
while (null != terms.Term) {
If (term.Field.Equals("category")) {
// do something with this term
}
terms.Next();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Ordering bookmarks by page using microsoft interop c# - c#

Use LINQ OrderBy: var orderedResults = MergeResultDoc.Bookmarks.OrderBy(d => d.Start).ToList();

Related

OpenXML Remove text from template

Add Comment in to selected Text in Word Document Using OpenXML c#

How to Define a PDF Outline Using MigraDoc

C# openxml removal of paragraph

Lucene.Net - Get distinct categories

Categories

Resources