Split Doc file Pages And Convert To PDF with gembox Document

Split Doc file Pages And Convert To PDF with gembox Document - c#

I want to convert the entire content of that page to PDF by searching for a specific word on each page (which may be on one page or more).
For example, we have a file that has three pages, there is a special word on the first page, and the next special word on the third page. I want to save the PDF from the first to the second page and then save the third page separately. The PDF files will be named according to the specific word on that page.
My problem is that I don't know how to loop for each page and read the content of that page to get to the special word and save the pages as a PDF.
Thank You

Here is how you can do it.
Paginate your Word document using DocumentModel.GetPaginator method.
Read the text content of each page using FrameworkElement.ToText extension method.
Save selected pages to PDF using DocumentModelPage.Save method.
In other words, try the following:
string search = "Your Specific Word";
string inputPath = "input.docx";
// Load Word document.
var document = DocumentModel.Load(inputPath);
// 1. Get document's pages.
var pages = document.GetPaginator().Pages;
for (int i = 0, count = pages.Count; i < count; ++i)
{
// 2. Read page's text content.
DocumentModelPage page = pages[i];
string pageTextContent = page.PageContent.ToText();
// 3. Save page as PDF.
if (pageTextContent.Contains(search))
{
string outputPath = $"{search}_{i}.pdf";
page.Save(outputPath);
}
}

Related

How to Identify the Page overflow in Aspose PDF

I am converting a HTML content to PDF using Aspose .Net API's. HTML content contains table which has dynamic data, so if the data is more date, then the table is overflown into second page.
Is there a way to Identify the page overflown in my document, because I want insert an empty page in between, whenever an overflow happens.
Code Snippet:
// Creating Doc
Aspose.Pdf.Document document = new Document();
for(var contentModel in listcontentModel)
{
Page page = document.Pages.Add();
string HTMLContent = //HTMLContent -- using contentModel to build HTML;
-- Sometime page overflow happens because of huge HTML content
HtmlFragment printHtml = new HtmlFragment(text);
page.Paragraphs.Add(printHtml);
}
document.Save(filePath);
Thanks in Advance.

Calculate number of pages in docs file by dot net and C# and Syncfusion.DocIO.Net.Core Package

I have a .net Core 3.1 webapi.
I want to get the number of pages are in docs and doc file.
Am using the Syncfusion.DocIO.Net.Core package to perform operation on docs/doc file. But it does't provide feature to update stat of file and display only the PageCount: document.BuiltinDocumentProperties.PageCount;
This is not updated by files. Would you someone suggest me how i can calculate arithmetically.

You can get the page count from BuiltInProperties of the document using DocIO as below. It shows the page count in the document while creating the document using Microsoft Word application and also it still returns the same page count if we manipulate the document using DoIO.
int count = document.BuiltinDocumentProperties.PageCount;
If you want to get the page count, after manipulating the Word document using DocIO, we suggest you to convert the word document to PDF, and then you can retrieve the page count as like below.
WordDocument wordDocument = new WordDocument(fileStream, FormatType.Docx);
DocIORenderer render = new DocIORenderer();
//Sets Chart rendering Options.
render.Settings.ChartRenderingOptions.ImageFormat = ExportImageFormat.Jpeg;
//Converts Word document into PDF document
PdfDocument pdfDocument = render.ConvertToPDF(wordDocument);
int pageCount = pdfDocument.PageCount;
Since Word document is a flow document in which contents will not be preserved page by page; instead the contents will be preserved sequentially section by section. Each section may extend to various pages based on its contents like table, text, images etc.
Whereas Essential DocIO is a non-UI component that provides a full-fledged document object model to manipulate the Word document contents. Hence it is not feasible to get the page count directly from Word document using DocIO.
We have prepared the sample application to get the page counts from the word document and it can be downloaded from the below link.
https://www.syncfusion.com/downloads/support/directtrac/general/ze/GetPageCount16494613

PDFsharp - overlay page from other PDF

I'm generating PDF files using PDFsharp, and I need to overlay the PDF I'm generating with a specific page from another PDF.
I've created this method:
private void ApplyOverlay(XGraphics graph, string overlaypdfPath, int pageNumberInOverlay, XRect coordinates)
{
var xPdf = XPdfForm.FromFile(overlaypdfPath);
if(xPdf.PageCount < pageNumberInOverlay)
throw new Exception("not enough pages");
//Here i need to take from xPdf just the page number -> pageNumberInOverlay
graph.DrawImage(xPdfPageN, coordinates);
}
I don’t know how to select only a specific page.

You can append the page number to the name of the PDF file, separated with a hash sign ("#").
To get page 7 of "sample.pdf", use the filename "sample.pdf#6" (zero-based page numbers).

Extract pages from a PDF file using ITextSharp

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE], so whatever contents before this fields need to be extracted, the [STOP_HERE] field could be located on a different page for each document, so using page numbers wouldn't help here.
I searched online and all I can find is a way to copy only form fields from a document but not the whole document elements including images texts with their exact location and style.
Can IText do the job here?
EDIT: More details
[STOP_HERE] is an AcroForms text field which has been placed in a document by the PDF design person to indicate that everything before this element should be copied as is into a different document. The field itself is not important, I don't want to fill or do anything with it, it's just used as a signal to let the document parser stop there and copy all previous (upper) contents, I just don't know how to read all contents (without changing style, contents, etc) before this field.

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE]
Unfortunately the OP didn't tell whether the page containing the form field [STOP_HERE] is to be included or not. As that is a mere +/-1 matter, though, I simply assumed the page is to be included.
Thus, the task can be implemented like this:
PdfReader reader = new PdfReader(srcFile);
AcroFields.Item field = reader.AcroFields.Fields["[STOP_HERE]"];
if (field != null)
{
int firstPage = reader.NumberOfPages + 1;
for (int index = 0; index < field.Size; index++)
{
int page = field.GetPage(index);
if (page > 0 && page < firstPage)
firstPage = page;
}
if (firstPage <= reader.NumberOfPages)
{
reader.SelectPages("1-" + firstPage);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();
}
}
reader.Close();
The code opens the source file in a PdfReader and first looks for the field. If it exists, it iterates over all appearances of that field and determines the earliest page with an appearance of the field. If there is such a page, the code restricts the reader to the pages up to that page and stores this restriction using a PdfStamper.

Open PDF file in a specific page using pdfbox

I have this program that makes a search, for example a sentence, in all pdf files of a folder.
It's working perfect...
But I would like to add a feature to open in the exact page of that sentence.
And I look through the documentation of pdfbox and I could not find anything that was specific for this.
I don't know if I let something pass by, but if somebody could enlighten me in this I would be very grateful
Thank you

I read your question earlier this week. At the time, I didn't have an answer for you. Then I stumbled on the methods setStartPage() and setEndPage() on the PDFBox documentation for the PDFTextStripper class and it made me think of your question and this answer. It's been about 4 months since you asked the question, but maybe this will help someone. I know I learned a thing or two while writing it.
When you search a PDF file, you can search a range of pages. The functions setStartPage() and setEndPage() set the range of pages you are searching. If we set the start and end page to the same page number, then we will know which page the search term was found on.
In the code below, I am using a windows forms application but you can adapt my code to fit your application.
using System;
using System.Windows.Forms;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
//The Diagnostics namespace is needed to specify PDF open parameters. More on them later.
using System.Diagnostics;
//specify the string you are searching for
string searchTerm = "golden";
//I am using a static file path
string pdfFilePath = #"F:\myFile.pdf";
//load the document
PDDocument document = PDDocument.load(pdfFilePath);
//get the number of pages
int numberOfPages = document.getNumberOfPages();
//create an instance of text stripper to get text from pdf document
PDFTextStripper stripper = new PDFTextStripper();
//loop through all the pages. We will search page by page
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++)
{
//set the start page
stripper.setStartPage(pageNumber);
//set the end page
stripper.setEndPage(pageNumber);
//get the text from the page range we set above.
//in this case we are searching one page.
//I used the ToLower method to make all the text lowercase
string pdfText = stripper.getText(document).ToLower();
//just for fun, display the text on each page in a messagebox. My pdf file only has two pages. But this might be annoying to you if you have more.
MessageBox.Show(pdfText);
//search the pdfText for the search term
if (pdfText.Contains(searchTerm))
{
//just for fun, display the page number on which we found the search term
MessageBox.Show("Found the search term on page " + pageNumber);
//create a process. We will be opening the pdf document to a specific page number
Process myProcess = new Process();
//I specified Adobe Acrobat as the program to open
myProcess.StartInfo.FileName = "Acrobat.exe";
//see link below for info on PDF document open parameters
myProcess.StartInfo.Arguments = "/A \"page=" + pageNumber + "=OpenActions\"" + pdfFilePath;
//Start the process
myProcess.Start();
//break out of the loop. we found our search term and we opened the PDF file
break;
}
}
//close the document we opened.
document.close();
Check out this Adobe pdf document on setting opening parameters of the PDF file:
http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.