Extract pages from a PDF file using ITextSharp

Extract pages from a PDF file using ITextSharp - c#

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE], so whatever contents before this fields need to be extracted, the [STOP_HERE] field could be located on a different page for each document, so using page numbers wouldn't help here.
I searched online and all I can find is a way to copy only form fields from a document but not the whole document elements including images texts with their exact location and style.
Can IText do the job here?
EDIT: More details
[STOP_HERE] is an AcroForms text field which has been placed in a document by the PDF design person to indicate that everything before this element should be copied as is into a different document. The field itself is not important, I don't want to fill or do anything with it, it's just used as a signal to let the document parser stop there and copy all previous (upper) contents, I just don't know how to read all contents (without changing style, contents, etc) before this field.

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE]
Unfortunately the OP didn't tell whether the page containing the form field [STOP_HERE] is to be included or not. As that is a mere +/-1 matter, though, I simply assumed the page is to be included.
Thus, the task can be implemented like this:
PdfReader reader = new PdfReader(srcFile);
AcroFields.Item field = reader.AcroFields.Fields["[STOP_HERE]"];
if (field != null)
{
int firstPage = reader.NumberOfPages + 1;
for (int index = 0; index < field.Size; index++)
{
int page = field.GetPage(index);
if (page > 0 && page < firstPage)
firstPage = page;
}
if (firstPage <= reader.NumberOfPages)
{
reader.SelectPages("1-" + firstPage);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();
}
}
reader.Close();
The code opens the source file in a PdfReader and first looks for the field. If it exists, it iterates over all appearances of that field and determines the earliest page with an appearance of the field. If there is such a page, the code restricts the reader to the pages up to that page and stores this restriction using a PdfStamper.

Related

Calculate number of pages in docs file by dot net and C# and Syncfusion.DocIO.Net.Core Package

I have a .net Core 3.1 webapi.
I want to get the number of pages are in docs and doc file.
Am using the Syncfusion.DocIO.Net.Core package to perform operation on docs/doc file. But it does't provide feature to update stat of file and display only the PageCount: document.BuiltinDocumentProperties.PageCount;
This is not updated by files. Would you someone suggest me how i can calculate arithmetically.

You can get the page count from BuiltInProperties of the document using DocIO as below. It shows the page count in the document while creating the document using Microsoft Word application and also it still returns the same page count if we manipulate the document using DoIO.
int count = document.BuiltinDocumentProperties.PageCount;
If you want to get the page count, after manipulating the Word document using DocIO, we suggest you to convert the word document to PDF, and then you can retrieve the page count as like below.
WordDocument wordDocument = new WordDocument(fileStream, FormatType.Docx);
DocIORenderer render = new DocIORenderer();
//Sets Chart rendering Options.
render.Settings.ChartRenderingOptions.ImageFormat = ExportImageFormat.Jpeg;
//Converts Word document into PDF document
PdfDocument pdfDocument = render.ConvertToPDF(wordDocument);
int pageCount = pdfDocument.PageCount;
Since Word document is a flow document in which contents will not be preserved page by page; instead the contents will be preserved sequentially section by section. Each section may extend to various pages based on its contents like table, text, images etc.
Whereas Essential DocIO is a non-UI component that provides a full-fledged document object model to manipulate the Word document contents. Hence it is not feasible to get the page count directly from Word document using DocIO.
We have prepared the sample application to get the page counts from the word document and it can be downloaded from the below link.
https://www.syncfusion.com/downloads/support/directtrac/general/ze/GetPageCount16494613

Unable to read text in a specific location in a pdf file using iTextSharp

I'm given to read a pdf texts and do some stuffs are extracting the texts. I 'm using iTextSharp to read the PDF. The problem here is that the PdfTextExtractor.GetTextFromPage doesnt give me all the contents of the page. For ex
In the above PDF I m unable to read texts that are highlighted in blue. Rest of the characters I m able t read. Below is the line that does the above
`string filePath = "myFile path";
PdfReader pdfReader = new PdfReader(filePath);
for (int page = 1; page<=1; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
}`
Any suggestions here?
I have went through lots of queries and solution in SO but not specific to this query.

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.
If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.
You can do that by adding after reading the PDF in this line
PdfReader pdfReader = new PdfReader(filePath);
code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:
MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();
memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);
Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.
Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.
You can change this by using a different text extraction strategy, i.e. by replacing this line:
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.
ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();

Split Doc file Pages And Convert To PDF with gembox Document

I want to convert the entire content of that page to PDF by searching for a specific word on each page (which may be on one page or more).
For example, we have a file that has three pages, there is a special word on the first page, and the next special word on the third page. I want to save the PDF from the first to the second page and then save the third page separately. The PDF files will be named according to the specific word on that page.
My problem is that I don't know how to loop for each page and read the content of that page to get to the special word and save the pages as a PDF.
Thank You

Here is how you can do it.
Paginate your Word document using DocumentModel.GetPaginator method.
Read the text content of each page using FrameworkElement.ToText extension method.
Save selected pages to PDF using DocumentModelPage.Save method.
In other words, try the following:
string search = "Your Specific Word";
string inputPath = "input.docx";
// Load Word document.
var document = DocumentModel.Load(inputPath);
// 1. Get document's pages.
var pages = document.GetPaginator().Pages;
for (int i = 0, count = pages.Count; i < count; ++i)
{
// 2. Read page's text content.
DocumentModelPage page = pages[i];
string pageTextContent = page.PageContent.ToText();
// 3. Save page as PDF.
if (pageTextContent.Contains(search))
{
string outputPath = $"{search}_{i}.pdf";
page.Save(outputPath);
}
}

Appending PDFs to generated PDF using iTextSharp

I am generating a pdf using iTextSharp. If certain properties are true then I also want to insert an existing pdf with static content.
private byte[] GeneratePdf(DraftOrder draftOrder)
// create a pdf document
var document = new Document();
// set the page size, set the orientation
document.SetPageSize(PageSize.A4);
// create a writer instance
var pdfWriter = PdfWriter.GetInstance(document, new FileStream(file, FileMode.Create));
document.Open();
if(draftOrder.hasProperty){
//add these things to the pdf
var textToBeAdded = "<table><tr>....</table>";
}
FormatHtml(document, textToBeAdded , css);
if(someOtherProperty){
//add static pdf from file
document.NewPage();
var reader = new PdfReader("myPath/existing.pdf");
PdfImportedPage page;
for(var i = 0; i < reader.NumberOfPages; i++){
//It's this bit I don't really understand
//**how can I add the page read to the document being created?**
}
I can load the pdf from the source but when I iterate over the pages I can't seem to be able to add them to the document I am creating.
Cheers

Please read http://manning.com/lowagie2/samplechapter6.pdf
If you don't mind losing all interactivity, you can get the template from the writer object with the GetImportedPage() method and add it to the document with AddTemplate ().
This question has been answered many times on StackOverflow and you'll notice that I always warn about some dangers: you need to realize that the dimensions of the imported page can be different from the page size you initially defined. Because of this invisible parts of the imported page can become visible; visible parts can become invisible.
I'd prefer adding the extra page in a second ho using PdfCopy, but maybe that's just me.

iTextSharp: Split pages size equals file size

Here is how I split a large PDF (144 mb):
public int SplitAndSave(string inputPath, string outputPath)
{
FileInfo file = new FileInfo(inputPath);
string name = file.Name.Substring(0, file.Name.LastIndexOf("."));
using (PdfReader reader = new PdfReader(inputPath))
{
for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
{
string filename = pagenumber.ToString() + ".pdf";
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileStream(outputPath + "\\" + filename, FileMode.Create));
document.Open();
copy.AddPage(copy.GetImportedPage(reader, pagenumber));
document.Close();
}
return reader.NumberOfPages;
}
}
For most PDFs (small size, and I guess old format), all works fine. But for a bigger one (that perhaps are using something like refstreams for better compression), the split pages open as one page, but its size is equal to the original PDF's size. What can I do?

In case of your document Top_Gear_Magazine_2012_09.pdf the reason is indeed the one I mentioned: All pages refer to object 2 0 R as their /Resources, and the dictionary in 2 0 obj in turn references all images in the PDF.
To split that document into partial documents containing only the images required, you should preprocess the document by first finding out which images belong to which pages and then creating individual /Resources dictionaries for all pages.
As you already use iText in this context, you can also use it to find out which images are required for which pages. Use the iText parser package to initially parse the PDF page by page using a RenderListener implementation whose RenderImage method simply remembers which image objects are used on the current page. (As a special twist, iText hides the name of the image XObject in question; you get the indirect object, though, and can query its object and generation number which suffices for the next step.)
In a second step, you open the document in a PdfStamperand iterate over the pages. For each page you retrieve the /Resources dictionary and copy it, but only copy those XObjects references referencing one of the image objects whose object number and generation you remembered for the respective page during the first step. Finally you set the diminished copy as the /Resources dictionary of the page in question.
The resulting PDF should split just fine.
PS A very similar issue recently came up on the iText mailing list. In that thread the solution recipe given here has been improved, to get around the difficulties caused by iText hiding the xobject name, I now would propose to intervene before the name is lost by using a different ContentOperator for "Do", here the Java version:
class Do implements ContentOperator
{
public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException
{
PdfName xobjectName = (PdfName)operands.get(0);
names.add(xobjectName);
}
final List<PdfName> names = new ArrayList<PdfName>();
}
This content operator simply collects the names of the used xobjects, i.e. the xobject resources to keep for the given page.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.