How to access OpenXML content by page number?

How to access OpenXML content by page number? - c#

Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.

List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

Related

OpenXML SDK: How to get "Tag" (p:tag) val of a PowerPoint shape?

I need to check all tags on all shapes on all slides. I can select each shape, however I can't see how to get the shape's tags.
For the given DocumentFormat.OpenXml.Presentation.Shape, how can I get the "val" of the tag with name="MOUNTAIN"
In my shape, the tag rId is in this structure: p:sp > p:nvSpPr > p:cNvPr > p:nvPr > p:custDataList > p:tags
I'm guessing my code needs to do these steps:
• Get the rId of the p:custDataLst p:tags
• Look up the "Target" file name in the slideX.xml.rels file, based on the rId
• Look in the root/tags folder for the "Target" file
• Get the p:tagLst p:tags and look for the p:tag with name="MOUNTAIN"
<p:tagLst
<p:tag name="MOUNTAIN" val="Denali"/>
</p:tagLst>
Here is how my code iterates through shapes on each slide:
for (int x = 0; x < doc.PresentationPart.SlideParts.Count(); x++)
{
SlidePart slide = doc.PresentationPart.SlideParts.ElementAt(x);
ShapeTree tree = slide.Slide.CommonSlideData.ShapeTree;
IEnumerable<DocumentFormat.OpenXml.Presentation.Shape> slShapes = slide.Slide.Descendants<DocumentFormat.OpenXml.Presentation.Shape>();
foreach (DocumentFormat.OpenXml.Presentation.Shape shape in slShapes)
{
//get the specified tag, if it exists
}
}
I see an example of how to add tags: How to add custom tags to powerpoint slides using OpenXml in c#
But I can't figure out how to read the existing tags.
So, how do I get the shape's tags with c#?
I was hoping to do something like this:
IEnumerable<UserDefinedTagsPart> userDefinedTagsParts = shape.NonVisualShapeProperties.ApplicationNonVisualDrawingProperties.CustomerDataList.CustomerDataTags<UserDefinedTagsPart>();
foreach (UserDefinedTagsPart userDefinedTagsPart in userDefinedTagsParts)
{}
but Visual Studio says "ApplicationNonVisualDrawingProperties does not contain a definition for CustomerDataList".
From the OpenXML Productivity Tool, here is the element tree:

You and I seem to be working on similar problems. I'm struggling with learning the file format. The following code is working for me, I'm sure it can be optimized.
public void ReadTags(Shape shape, SlidePart slidePart)
{
NonVisualShapeProperties nvsp = shape.NonVisualShapeProperties;
ApplicationNonVisualDrawingProperties nvdp = nvsp.ApplicationNonVisualDrawingProperties;
IEnumerable<CustomerDataTags> data_tags = nvdp.Descendants<CustomerDataTags>();
foreach (var data_tag in data_tags)
{
UserDefinedTagsPart shape_tags = slidePart.GetPartById(data_tag.Id) as UserDefinedTagsPart;
if (shape_tags != null)
{
foreach (Tag tag in shape_tags.TagList)
{
Debug.Print($"\t{nvsp.NonVisualDrawingProperties.Name} tag {tag.Name} = '{tag.Val}");
}
}
}
}

I've spent a lot of time with OpenXML .docx and .xlsx files ... but not so much with .pptx.
Nevertheless, here are a couple of suggestions that might help:
If you haven't already done so, please downoad the OpenXML SDK Productivity Tool to analyze your file's contents. It's currently available on GitHub:
https://github.com/dotnet/Open-XML-SDK/releases/tag/v2.5
You might simply be able to "grep" for items you're looking for.
EXAMPLE (Word, not PowerPoint... but the same principle should apply):
using (doc = WordprocessingDocument.Open(stream, true))
{
// Init OpenXML members
mainPart = doc.MainDocumentPart;
body = mainPart.Document.Body;
...
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains(target))
...

How to dynamically link bookmarks to table of contents using PDFsharp + MigraDoc

I'm trying to create a Table of Contents using MigraDoc and PDFsharp and I've gotten really close but the problem I'm currently having is that the links on the Table of Contents all take me to the very first page of the PDF. I'm trying to link them to their respective pages. PDFSharp bookmarks work fine but when trying to create a table of contents based on the merged PDF it's not working.
static void TableOfContents(PdfDocument document)
{
// Puts the Table of contents on the second page
PdfPage page = document.Pages[1];
XGraphics gfx = XGraphics.FromPdfPage(page);
gfx.MUH = PdfFontEncoding.Unicode;
// Create MigraDoc document + Setup styles
Document doc = new Document();
Styles.DefineStyles(doc);
// Add header
Section section = doc.AddSection();
Paragraph paragraph = section.AddParagraph("Table of Contents");
paragraph.Format.Font.Size = 14;
paragraph.Format.Font.Bold = true;
paragraph.Format.SpaceAfter = 24;
paragraph.Format.OutlineLevel = OutlineLevel.Level1;
// Add links - these are the PdfSharp outlines/bookmarks
// added previously when concatinating the pages
foreach (var bookmark in document.Outlines)
{
paragraph = section.AddParagraph();
paragraph.Style = "TOC";
paragraph.AddBookmark(bookmark.Title);
Hyperlink hyperlink = paragraph.AddHyperlink(bookmark.Title);
hyperlink.AddText($"{bookmark.Title}\t");
hyperlink.AddPageRefField(bookmark.Title);
}
// Render document
DocumentRenderer docRenderer = new DocumentRenderer(doc);
docRenderer.PrepareDocument();
docRenderer.RenderPage(gfx, 1);
gfx.Dispose();
}
Ideally I want it to return the file's name (which it's doing) and the page number (it's only returning the first page). This is what it's currently outputting.
Table of Contents
file name here......................... 1
file name here......................... 1
file name here......................... 1
file name here......................... 1

As I understand it, the Hyperlink and bookmark should be unique to the document.
Otherwise the link will be made to the first paragraph containing the bookmark.
I simply use a number which I increase for a simple report I make.
private void DefineTOCLine(int level, string text, Paragraph linkTo)
{
var tocIndex = (tocindex++).ToString(CultureInfo.InvariantCulture);
var paragraph = tocsection.AddParagraph();
paragraph.Style = level == 1 ? "TOC1" : "TOC2";
var hyperlink = paragraph.AddHyperlink(tocIndex);
hyperlink.AddText(text + "\t");
hyperlink.AddPageRefField(tocIndex);
linkTo.AddBookmark(tocIndex);
}

You invoke hyperlink.AddPageRefField to set a reference, but as far as I can tell you never create the MigraDoc bookmark for the target of the reference by calling MigraDoc's AddBookmark method.
MigraDoc bookmarks are different from PDF file bookmarks.

How to convert Multiple HTML Pages to Single Doc in c#

I am converting a single HTML page to Doc using spire doc. I need to convert multiple html pages from single folder to single Doc. How this can be done. Can anyone give some idea or any library available to achieve this?
Please find my code to convert single HTML to Doc.
Spire.Doc.Document document = new Spire.Doc.Document();
document.LoadFromFile(#"D:\DocFilesConvert\htmlfile.html", Spire.Doc.FileFormat.Html, XHTMLValidationType.None);
document.SaveToFile(#"D:\DocFilesConvert\docfiless.docx", Spire.Doc.FileFormat.Docx);

There seems no direct way to achieve this. One workaround I find is to convert each HTML document to a single Word file, and then merge these Word files in one file.
//get HTML file paths
string[] htmlfilePaths = new string[]{
#"F:\Documents\Html\1.html",
#"F:\Documents\Html\2.html",
#"F:\Documents\Html\3.html"
};
//create Document array
Document[] docs = new Document[htmlfilePaths.Length];
for (int i = 0; i < htmlfilePaths.Length; i++)
{
//load each HTML to a sperate Word file
docs[i] = new Document(htmlfilePaths[i], FileFormat.Html);
//combine these Word files in one file
if (i>=1)
{
foreach (Section sec in docs[i].Sections)
{
docs[0].Sections.Add(sec.Clone());
}
}
}
//save to a Word document
docs[0].SaveToFile("output.docx", FileFormat.Docx2013);

Edit Total Number of Pages in Footer SelectPDF

I am converting html page to pdf using HtmlToPdf() of SelectPDF. Since html content is big, I am breaking it in half and creating 2 PDFs.
I am struggling to edit the total_pages in the footer to display actual total number of the pages, not only the current document; as well as page_number to display the actual page number in the context of both PDFs.
How can I assess {page_number} and {total_pages} to calculate proper values? All examples I found use PdfDocument(), not HtmlToPdf().
Dim converter As New HtmlToPdf()
Dim text As New PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ")
text.HorizontalAlign = PdfTextHorizontalAlign.Center
converter.Footer.Add(text)
I am tagging both C# and VB since SelectPDF is for both languages, and relevant sample from either one will work for me. Thank you

Today I've stumbled upon the same issue and I have found a work-around for the problem. The converter was able to show page numbers for it's the generated document but can't be aware of multiple generated files (you can't access the page properties) so all my pages I concatenated were showing Page 1 of 1.
First I define one PdfDocument (see it as the main document) and I use HtmlToPdf to append html converted files to this main document.
// Create converter
converter = new HtmlToPdf();
PdfTextSection text = new PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ", new Font("Arial", 8));
text.HorizontalAlign = PdfTextHorizontalAlign.Right;
converter.Footer.Add(text);
// Create main document
pdfDocument = new PdfDocument();
Then I add pages (from html) using this method
public void AddPage(string htmlPage)
{
PdfDocument doc = converter.ConvertHtmlString(htmlPage);
pdfDocument.Append(doc);
converter.Footer.TotalPagesOffset += doc.Pages.Count;
converter.Footer.FirstPageNumber += doc.Pages.Count;
}
This results in correct page numbers for the main document. The same trick could be used for splitting files and page numbers over multiple documents like you described.
EDIT: In case you don't see any page numbering using the HtmlToPdf converter, don't forget to set following property:
converter.Options.DisplayFooter = true;

There is an open source library called itextsharp that will help get total page count.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
namespace GetPages_PDF
{
class Program
{
static void Main(string[] args)
{
// Right side of equation is location of YOUR pdf file
string ppath = "C:\\aworking\\Hawkins.pdf";
PdfReader pdfReader = new PdfReader(ppath);
int numberOfPages = pdfReader.NumberOfPages;
Console.WriteLine(numberOfPages);
Console.ReadLine();
}
}
}
Then you can stamp text also on the page but you will need to add the location to where it needs to go.
link: http://crmhunt.com/how-to-modify-pdf-file-using-itextsharp/
hope this helps in some way.

You should use the following properties:
FirstPageNumber - Controls the page number for the first page being
rendered.
TotalPagesOffset - Controls the total number of pages
offset in the generated pdf document.
More details here:
http://selectpdf.com/html-to-pdf/docs/html/HtmlToPdfHeadersAndFooters.htm

The answers above did not work for me as I was trying to merge multiple PDFs with different orientations. bonnoj's answer did add page numbers but they were incorrect and I couldn't find a way to correct them. So I took a different approach - I created a PDF, then for each HTML page I added a pdfPage and then added a PdfHtmlElement to that page. Finally I loop over the pages and add a custom footer to each page. This may not be the most efficient way to do this but it's the only way that I could find that added the footer in the correct place when mixing portrait and landscape pages. Hopefully it will save somebody else spending hours playing with different properties.
var pdfDocument = new PdfDocument(PdfStandard.Full);
foreach (var (html, pdfPageOrientation) in pages)
{
var page = pdfDocument.AddPage(PdfCustomPageSize.A4, new PdfMargins(marginLeft, marginRight, marginTop, marginBottom));
page.Orientation = pdfPageOrientation;
var pdfHtmlElement = new PdfHtmlElement(html, "");
page.Add(pdfHtmlElement);
}
var pdfFont = pdfDocument.AddFont(PdfStandardFont.Helvetica);
pdfFont.Size = 12;
foreach (PdfPage page in pdfDocument.Pages)
{
var customFooter = pdfDocument.AddTemplate(page.PageSize.Width, 30);
var pdfFooterTextElement = new PdfTextElement(0, 15,
pageFooterText,
pdfFont)
{
HorizontalAlign = PdfTextHorizontalAlign.Right,
VerticalAlign = PdfTextVerticalAlign.Bottom,
};
customFooter.Add(pdfFooterTextElement);
page.CustomFooter = customFooter;
}
pdfDocument.Save(stream);

iTextSharp can't read some PDF files

I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?

You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to access OpenXML content by page number? - c#

List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList(); List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

Related

OpenXML SDK: How to get "Tag" (p:tag) val of a PowerPoint shape?

How to dynamically link bookmarks to table of contents using PDFsharp + MigraDoc

How to convert Multiple HTML Pages to Single Doc in c#

Edit Total Number of Pages in Footer SelectPDF

iTextSharp can't read some PDF files

Categories

Resources