Open PDF file in a specific page using pdfbox

Open PDF file in a specific page using pdfbox - c#

I have this program that makes a search, for example a sentence, in all pdf files of a folder.
It's working perfect...
But I would like to add a feature to open in the exact page of that sentence.
And I look through the documentation of pdfbox and I could not find anything that was specific for this.
I don't know if I let something pass by, but if somebody could enlighten me in this I would be very grateful
Thank you

I read your question earlier this week. At the time, I didn't have an answer for you. Then I stumbled on the methods setStartPage() and setEndPage() on the PDFBox documentation for the PDFTextStripper class and it made me think of your question and this answer. It's been about 4 months since you asked the question, but maybe this will help someone. I know I learned a thing or two while writing it.
When you search a PDF file, you can search a range of pages. The functions setStartPage() and setEndPage() set the range of pages you are searching. If we set the start and end page to the same page number, then we will know which page the search term was found on.
In the code below, I am using a windows forms application but you can adapt my code to fit your application.
using System;
using System.Windows.Forms;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
//The Diagnostics namespace is needed to specify PDF open parameters. More on them later.
using System.Diagnostics;
//specify the string you are searching for
string searchTerm = "golden";
//I am using a static file path
string pdfFilePath = #"F:\myFile.pdf";
//load the document
PDDocument document = PDDocument.load(pdfFilePath);
//get the number of pages
int numberOfPages = document.getNumberOfPages();
//create an instance of text stripper to get text from pdf document
PDFTextStripper stripper = new PDFTextStripper();
//loop through all the pages. We will search page by page
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++)
{
//set the start page
stripper.setStartPage(pageNumber);
//set the end page
stripper.setEndPage(pageNumber);
//get the text from the page range we set above.
//in this case we are searching one page.
//I used the ToLower method to make all the text lowercase
string pdfText = stripper.getText(document).ToLower();
//just for fun, display the text on each page in a messagebox. My pdf file only has two pages. But this might be annoying to you if you have more.
MessageBox.Show(pdfText);
//search the pdfText for the search term
if (pdfText.Contains(searchTerm))
{
//just for fun, display the page number on which we found the search term
MessageBox.Show("Found the search term on page " + pageNumber);
//create a process. We will be opening the pdf document to a specific page number
Process myProcess = new Process();
//I specified Adobe Acrobat as the program to open
myProcess.StartInfo.FileName = "Acrobat.exe";
//see link below for info on PDF document open parameters
myProcess.StartInfo.Arguments = "/A \"page=" + pageNumber + "=OpenActions\"" + pdfFilePath;
//Start the process
myProcess.Start();
//break out of the loop. we found our search term and we opened the PDF file
break;
}
}
//close the document we opened.
document.close();
Check out this Adobe pdf document on setting opening parameters of the PDF file:
http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf

Related

Split Doc file Pages And Convert To PDF with gembox Document

I want to convert the entire content of that page to PDF by searching for a specific word on each page (which may be on one page or more).
For example, we have a file that has three pages, there is a special word on the first page, and the next special word on the third page. I want to save the PDF from the first to the second page and then save the third page separately. The PDF files will be named according to the specific word on that page.
My problem is that I don't know how to loop for each page and read the content of that page to get to the special word and save the pages as a PDF.
Thank You

Here is how you can do it.
Paginate your Word document using DocumentModel.GetPaginator method.
Read the text content of each page using FrameworkElement.ToText extension method.
Save selected pages to PDF using DocumentModelPage.Save method.
In other words, try the following:
string search = "Your Specific Word";
string inputPath = "input.docx";
// Load Word document.
var document = DocumentModel.Load(inputPath);
// 1. Get document's pages.
var pages = document.GetPaginator().Pages;
for (int i = 0, count = pages.Count; i < count; ++i)
{
// 2. Read page's text content.
DocumentModelPage page = pages[i];
string pageTextContent = page.PageContent.ToText();
// 3. Save page as PDF.
if (pageTextContent.Contains(search))
{
string outputPath = $"{search}_{i}.pdf";
page.Save(outputPath);
}
}

Find then save web page to Drive Using C#

i have a problem i want to find a specific string in a web page then save the web page that i found the string.
I am using firefox for web browser
Problem :
1. I open a page (Containing a random word)
2. Then my C# program doing searching in the page, if the word find in the page then program will automaticaly save the page to Drive . If not the program will do click on Next Button on the page then do search again in the page.
Is that possible ?

Ok, so it sounds like you might want to do something like the following.
You can use WebClient to load the response from a url into a string:
using(WebClient client = new WebClient()) {
string s = client.DownloadString(your_url);
}
You can then search for a occurrence of the string you a looking for in "s" using indexOf:
if (s.IndexOf("string you are searching for") > -1)
{
// s contains "string you are searching for"
}
Then you can save "s" to disk using a StreamWriter:
using(StreamWriter sw = new StreamWriter("file name"))
{
sw.WriteLine(s);
}
In terms of clicking the "next" button can you define the urls as a list of strings and then just iterate over them using the previous code for each.

iTextSharp GetTextFromPage Only Returns First Page

I am using iTextSharp Version 5.5.12
The code knows there are 10 pages in my pdf. In my loop, only the first page is returned.
PdfReader Pdf = new PdfReader(PATH_TO_PDF);
for (intPageNum = 1; intPageNum <= Pdf.NumberOfPages; intPageNum++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string strPageText = PdfTextExtractor.GetTextFromPage(Pdf, intPageNum, strategy);
}
As I step through all ten iterations of the loop, only on the first iteration does strPageText have any text in it.
Any thoughts on what I am doing wrong?
Thanks in advance.

The "problem" appears to be a scanning software setting that combines multiple pdf files into one document (file).
Image Capture Plus software has a Job Setting, on the File tab, under OCR Settings for Searchable PDF. Make sure it is set to "All Pages".

PdfTextExtractor.GetTextFromPage suddenly giving empty string

We've been using the iTextSharp libraries for a couple of years now within an SSIS process to read some values out of a set of PDF exam documents. Everything has been running nicely until this week when suddenly we are getting the return of an empty string when calling the PdfTextExtractor.GetTextFromPage method. I'll include the code here:
// Read the data from the blob column where the PDF exists
byte[] byteBuffer = Row.FileData.GetBlobData(0, (int)Row.FileData.Length);
using (var pdfReader = new PdfReader(byteBuffer))
{
// Here is the important stuff
var extractStrategy = new LocationTextExtractionStrategy();
// This call will extract the page with the proper data on it depending on the exam type
// 1-page exams = NBOME - need to read first page for exam result data
// 2-page exams = NBME - need to read second page for exam result data
// The next two statements utilize this construct.
var vendor = pdfReader.NumberOfPages == 1 ? "NBOME" : "NBME";
*** THIS NEXT LINE GIVES THE EMPTY STRING
var newText = PdfTextExtractor.GetTextFromPage(pdfReader, pdfReader.NumberOfPages == 1 ? 1 : 2, extractStrategy);
var stringList = newText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
var fileParser = FileParseFactory.GetFileParse(stringList, vendor);
// Populate our output variables
Row.ParsedExamName = fileParser.GetExamName(stringList);
Row.DateParsed = DateTime.Now;
Row.ParsedId = fileParser.GetStudentId(stringList);
Row.ParsedTestDate = fileParser.GetTestDate(stringList);
Row.ParsedTestDateString = fileParser.GetTestDateAsString(stringList);
Row.ParsedName = fileParser.GetStudentName(stringList);
Row.ParsedTotalScore = fileParser.GetTestScore(stringList);
Row.ParsedVendor = vendor;
}
This is not for all PDFs, by the way. To explain more, we are reading in exam files. One of the exam types (NBME) seems to be reading just fine. However, the other type (NBOME) is not. However, prior to this week, the NBOME ones were being read fine.
This leads me to think it is an internal format change of the PDF file itself.
Also, another bit of information is that the actual pdfReader has data - I can get a byte[] array of the data - but the call to get any text is simply giving me empty.
I'm sorry I'm not able to show any exam data or files - that information is sensitive.
Has anybody seen something like this? If so, any possible solutions?

Well - we have found our answer. The user was originally going to the NBOME web site and downloading the PDF exam result files to import into my parsing system. Like I said, this worked for quite some time. Recently (this week), however, the user started not downloading the files, but using a PDF printing feature and printed the PDF files as PDF. When she did that, the problem occurred.
Bottom line, it looks like the printing the PDF as PDF may have been injecting some characters or something under the covers that was causing the reading of the PDF via iTextSharp to not fail, but to give an empty string. She should have just continued downloading them directly.
Thanks to those who offered some comments!

Extract pages from a PDF file using ITextSharp

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE], so whatever contents before this fields need to be extracted, the [STOP_HERE] field could be located on a different page for each document, so using page numbers wouldn't help here.
I searched online and all I can find is a way to copy only form fields from a document but not the whole document elements including images texts with their exact location and style.
Can IText do the job here?
EDIT: More details
[STOP_HERE] is an AcroForms text field which has been placed in a document by the PDF design person to indicate that everything before this element should be copied as is into a different document. The field itself is not important, I don't want to fill or do anything with it, it's just used as a signal to let the document parser stop there and copy all previous (upper) contents, I just don't know how to read all contents (without changing style, contents, etc) before this field.

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE]
Unfortunately the OP didn't tell whether the page containing the form field [STOP_HERE] is to be included or not. As that is a mere +/-1 matter, though, I simply assumed the page is to be included.
Thus, the task can be implemented like this:
PdfReader reader = new PdfReader(srcFile);
AcroFields.Item field = reader.AcroFields.Fields["[STOP_HERE]"];
if (field != null)
{
int firstPage = reader.NumberOfPages + 1;
for (int index = 0; index < field.Size; index++)
{
int page = field.GetPage(index);
if (page > 0 && page < firstPage)
firstPage = page;
}
if (firstPage <= reader.NumberOfPages)
{
reader.SelectPages("1-" + firstPage);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();
}
}
reader.Close();
The code opens the source file in a PdfReader and first looks for the field. If it exists, it iterates over all appearances of that field and determines the earliest page with an appearance of the field. If there is such a page, the code restricts the reader to the pages up to that page and stores this restriction using a PdfStamper.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Open PDF file in a specific page using pdfbox - c#

Related

Split Doc file Pages And Convert To PDF with gembox Document

Find then save web page to Drive Using C#

iTextSharp GetTextFromPage Only Returns First Page

PdfTextExtractor.GetTextFromPage suddenly giving empty string

Extract pages from a PDF file using ITextSharp

Categories

Resources