iTextSharp GetTextFromPage Only Returns First Page

iTextSharp GetTextFromPage Only Returns First Page - c#

I am using iTextSharp Version 5.5.12
The code knows there are 10 pages in my pdf. In my loop, only the first page is returned.
PdfReader Pdf = new PdfReader(PATH_TO_PDF);
for (intPageNum = 1; intPageNum <= Pdf.NumberOfPages; intPageNum++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string strPageText = PdfTextExtractor.GetTextFromPage(Pdf, intPageNum, strategy);
}
As I step through all ten iterations of the loop, only on the first iteration does strPageText have any text in it.
Any thoughts on what I am doing wrong?
Thanks in advance.

The "problem" appears to be a scanning software setting that combines multiple pdf files into one document (file).
Image Capture Plus software has a Job Setting, on the File tab, under OCR Settings for Searchable PDF. Make sure it is set to "All Pages".

Related

Exporting WPF Canvas to PDF

I've been attempting to find an easy solution to exporting a Canvas in my WPF Application to a PDF Document.
So far, the best solution has been to use the PrintDialog and set it up to automatically use the Microsoft Print the PDF 'printer'. The only problem I have had with this is that although the PrintDialog is skipped, there is a FileDialog to choose where the file should be saved.
Sadly, this is a deal-breaker because I would like to run this over a large number of canvases with automatically generated PDF names (well, programitically provided anyway).
Other solutions I have looked at include:
Using PrintDocument, but from my experimentation I would have to manually iterate through all my Canveses children and manually invoke the correct Draw method (of which a lot of my custom elements with transformation would be rather time consuming to do)
Exporting as a PNG image and then embedding that in a PDF. Although this works, TextBlocks within my canvas are no longer text. So this isn't an ideal situation.
Using the 3rd party library PDFSharp has the same downfall as the PrintDocument. A lot of custom logic for each element.
With PDFSharp. I did find a method fir generating the XGraphics from a Canvas but no way of then consuming that object to make a PDF Page
So does anybody know how I can skip or automate the PDF PrintDialog, or consume PDFSharp XGraphics to make
A page. Or any other ideas for directions to take this besides writing a whole library to convert each of my Canvas elements to PDF elements.

If you look at the output port of a recent windows installation of Microsoft Print To PDF
You may note it is set to PORTPROMP: and that is exactly what causes the request for a filename.
You might note lower down, I have several ports set to a filename, and the fourth one down is called "My Print to PDF"
So very last century methodology; when I print with a duplicate printer but give it a different name I can use different page ratios etc., without altering the built in standard one. The output for a file will naturally be built:-
A) Exactly in one repeatable location, that I can file monitor and rename it, based on the source calling the print sequence, such that if it is my current default printer I can right click files to print to a known \folder\file.pdf
B) The same port can be used via certain /pt (printto) command combinations to output, not just to that default port location, but to a given folder\name such as
"%ProgramFiles%\Windows NT\Accessories\WORDPAD.EXE" /pt listIN.doc "My Print to PDF" "My Print to PDF" "listOUT.pdf"
Other drivers usually charge for the convenience of WPF programmable renaming, but I will leave you that PrintVisual challenge for another of your three wishes.
MS suggest XPS is best But then they would be promoting it as a PDF competitor.
It does not need to be Doc[X]2PDF it could be [O]XPS2PDF or aPNG2PDF or many pages TIFF2PDF etc. etc. Any of those are Native to Win 10 also other 3rd party apps such as [Free]Office with a PrintTo verb will do XLS[X]2PDF. Imagination becomes pagination.

I had a great success in generating PDFs using PDFSharp in combination with SkiaSharp (for more advanced graphics).
Let me begin from the very end:
you save the PdfDocument object in the following way:
PdfDocument yourDocument = ...;
string filename = #"your\file\path\document.pdf"
yourDocument.Save(filename);
creating the PdfDocument with a page can be achieved the following way (adjust the parameters to fit your needs):
PdfDocument yourDocument = new PdfDocument();
yourDocument.PageLayout = PdfPageLayout.SinglePage;
yourDocument.Info.Title = "Your document title";
PdfPage yourPage = yourDocument.AddPage();
yourDocument.Orientation = PageOrientation.Landscape;
yourDocument.Size = PageSize.A4;
the PdfPage object's content (as an example I'm putting a string and an image) is filled in the following way:
using (XGraphics gfx = XGraphics.FromPdfPage(yourPage))
{
XFont yourFont = new XFont("Helvetica", 20, XFontStyle.Bold);
gfx.DrawString(
"Your string in the page",
yourFont,
XBrushes.Black,
new XRect(0, XUnit.FromMillimeter(10), page.Width, yourFont.GetHeight()),
XStringFormats.Center);
using (Stream s = new FileStream(#"path\to\your\image.png", FileMode.Open))
{
XImage image = XImage.FromStream(s);
var imageRect = new XRect()
{
Location = new XPoint() { X = XUnit.FromMillimeter(42), Y = XUnit.FromMillimeter(42) },
Size = new XSize() { Width = XUnit.FromMillimeter(42), Height = XUnit.FromMillimeter(42.0 * image.PixelHeight / image.PixelWidth) }
};
gfx.DrawImage(image, imageRect);
}
}
Of course, the font objects can be created as static members of your class.
And this is, in short to answer your question, how you consume the XGraphics object to create a PDF page.
Let me know if you need more assistance.

Exporting RDLC on letter head

I have a requirement to export RDLC report into PDF, it should also contain letter head of the company in background. Problem I see is that RDLC has a header, body and footer, how do we apply common image background? Any idea to this issue?

(Posted as an answer, since it's too long for a comment.)
I don't know of a method to add a page background directly in an RDLC. However, we had a similar issue with our report generator (MS Access, not RDLC), and solved it by (1) creating the PDF without letterhead and then (2) using PDFSharp to merge the resulting PDF with a letterhead ("background") PDF. Something like this might work for your use case as well.
We use the following code:
public static void AddBackground(string source, string background, string result)
{
using (var formBackground = XPdfForm.FromFile(background))
using (var pdf = PdfReader.Open(source, PdfDocumentOpenMode.Modify))
{
foreach (var page in pdf.Pages.Cast<PdfPage>())
{
var xg = XGraphics.FromPdfPage(page, XGraphicsPdfPageOptions.Prepend);
xg.DrawImage(formBackground, 0, 0);
if (formBackground.PageIndex < formBackground.PageCount - 1)
{
formBackground.PageIndex += 1;
}
}
pdf.Save(result);
}
}
All parameters are paths to the respective PDF files. If the background PDF has less pages than the source PDF, then the last page of the background PDF is added to all remaining source PDF pages. It is useful if your first page has a different background than all remaining pages, you just need a 2-page background PDF for that.

getting existing pdf as a byte array then rotate it as A4 Landscape in c#

I have a pdf which produced by SSRS. I need to get this pdf as a byte array then save whole pdf as a A4.Landscape.
I try ;
string say ="hello world";
byte [] pdfArr=Encoding.UTF8.GetBytes(say)
var doc = new Document(iTextSharp.text.PageSize.A4_Landscape.Rotate());
string path = Environment.CurrentDirectory;
PdfWriter.GetInstance(doc, new FileStream(path,"/pdfdoc.pdf",FileMode.Create));
doc.Open();
doc.Add(new Paragraph(Encoding.UTF8.GetString(pdfArr)));
doc.Close();
Process.Start(path+"/pdfdoc.pdf");
When I create new pdf by iTextsharp the above code works fine but when I try for the SSRS pdf, the pdf's inside fills with meaningless characters.
Also I know that, I can read and rotate page by page via PDFReader but I don't want to read the pages. Because, the reports table is too long so it divides into pages, I don't know how many pages should involved for one table, so my main aim is showing them in horizantal (landscape) as one table.
Any suggestions or code pieces are welcomed.
Thanks anyway..
Edit : As I explained in above paragraph, I can't take pages with pdfReader or something else because I don't want to change every page as landscape and I can't. It doesn't serve my aim. I just wat to create pdf as a landscape so all the loıng tables anda datas can seen in one page.

Extract pages from a PDF file using ITextSharp

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE], so whatever contents before this fields need to be extracted, the [STOP_HERE] field could be located on a different page for each document, so using page numbers wouldn't help here.
I searched online and all I can find is a way to copy only form fields from a document but not the whole document elements including images texts with their exact location and style.
Can IText do the job here?
EDIT: More details
[STOP_HERE] is an AcroForms text field which has been placed in a document by the PDF design person to indicate that everything before this element should be copied as is into a different document. The field itself is not important, I don't want to fill or do anything with it, it's just used as a signal to let the document parser stop there and copy all previous (upper) contents, I just don't know how to read all contents (without changing style, contents, etc) before this field.

Is it possible using IText to copy PDF pages from a full PDF document and return partial document based on a form field name? For example I need to copy the beginning of a pdf document and stop at a certain text field called [STOP_HERE]
Unfortunately the OP didn't tell whether the page containing the form field [STOP_HERE] is to be included or not. As that is a mere +/-1 matter, though, I simply assumed the page is to be included.
Thus, the task can be implemented like this:
PdfReader reader = new PdfReader(srcFile);
AcroFields.Item field = reader.AcroFields.Fields["[STOP_HERE]"];
if (field != null)
{
int firstPage = reader.NumberOfPages + 1;
for (int index = 0; index < field.Size; index++)
{
int page = field.GetPage(index);
if (page > 0 && page < firstPage)
firstPage = page;
}
if (firstPage <= reader.NumberOfPages)
{
reader.SelectPages("1-" + firstPage);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();
}
}
reader.Close();
The code opens the source file in a PdfReader and first looks for the field. If it exists, it iterates over all appearances of that field and determines the earliest page with an appearance of the field. If there is such a page, the code restricts the reader to the pages up to that page and stores this restriction using a PdfStamper.

Open PDF file in a specific page using pdfbox

I have this program that makes a search, for example a sentence, in all pdf files of a folder.
It's working perfect...
But I would like to add a feature to open in the exact page of that sentence.
And I look through the documentation of pdfbox and I could not find anything that was specific for this.
I don't know if I let something pass by, but if somebody could enlighten me in this I would be very grateful
Thank you

I read your question earlier this week. At the time, I didn't have an answer for you. Then I stumbled on the methods setStartPage() and setEndPage() on the PDFBox documentation for the PDFTextStripper class and it made me think of your question and this answer. It's been about 4 months since you asked the question, but maybe this will help someone. I know I learned a thing or two while writing it.
When you search a PDF file, you can search a range of pages. The functions setStartPage() and setEndPage() set the range of pages you are searching. If we set the start and end page to the same page number, then we will know which page the search term was found on.
In the code below, I am using a windows forms application but you can adapt my code to fit your application.
using System;
using System.Windows.Forms;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
//The Diagnostics namespace is needed to specify PDF open parameters. More on them later.
using System.Diagnostics;
//specify the string you are searching for
string searchTerm = "golden";
//I am using a static file path
string pdfFilePath = #"F:\myFile.pdf";
//load the document
PDDocument document = PDDocument.load(pdfFilePath);
//get the number of pages
int numberOfPages = document.getNumberOfPages();
//create an instance of text stripper to get text from pdf document
PDFTextStripper stripper = new PDFTextStripper();
//loop through all the pages. We will search page by page
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++)
{
//set the start page
stripper.setStartPage(pageNumber);
//set the end page
stripper.setEndPage(pageNumber);
//get the text from the page range we set above.
//in this case we are searching one page.
//I used the ToLower method to make all the text lowercase
string pdfText = stripper.getText(document).ToLower();
//just for fun, display the text on each page in a messagebox. My pdf file only has two pages. But this might be annoying to you if you have more.
MessageBox.Show(pdfText);
//search the pdfText for the search term
if (pdfText.Contains(searchTerm))
{
//just for fun, display the page number on which we found the search term
MessageBox.Show("Found the search term on page " + pageNumber);
//create a process. We will be opening the pdf document to a specific page number
Process myProcess = new Process();
//I specified Adobe Acrobat as the program to open
myProcess.StartInfo.FileName = "Acrobat.exe";
//see link below for info on PDF document open parameters
myProcess.StartInfo.Arguments = "/A \"page=" + pageNumber + "=OpenActions\"" + pdfFilePath;
//Start the process
myProcess.Start();
//break out of the loop. we found our search term and we opened the PDF file
break;
}
}
//close the document we opened.
document.close();
Check out this Adobe pdf document on setting opening parameters of the PDF file:
http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.