iTextSharp v5 GetTextFromPage() throws IndexOutOfRangeException - c#

Trying to extract the textual content of a pdf with the following code:
PdfReader reader = new PdfReader(path);
string strText = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
string s = PdfTextExtractor.GetTextFromPage(reader, page);
strText += " " + s;
}
reader.Close();
NumberOfPages returns 257, but at page 227, GetTextFromPage() throws a IndexOutOfRangeException.
Any help is appreciated.
hofnarwillie

I resolved this issue by updating my version of iTextSharp from 5.1 to 5.2.

Related

How to Extract Text From a Landscape PDF File

I'm trying to extract the text from the landscape pdf file, I'm using iTextSharp, for Portrait pages, it works well but returns an empty string for Landscape pages.
here is my code
PdfReader reader = new PdfReader(pdfFile);
int intPageNum = reader.NumberOfPages;
var sb = new StringBuilder();
for (int i = 1; i <= intPageNum; i++) {
var text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
sb.Append(text + "\n");
}

Search Multiple PDFs in a directory for a string in C# using itext7

I am trying to search for text in each PDF inside of a directory using itext7. I can figure out how to search just one PDF.
I managed to search one pdf using the below code, how can I make this work for each PDF in a directory?
public List<int> ReadPdfFile(string fileName, String searchString)
{
List<int> pages = new List<int>();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searchString))
{
MessageBox.Show("Found COLLIN GRADY");
}
else
{
MessageBox.Show("Could not find COLLIN GRADY");
}
}
pdfReader.Close();
}
return pages;
}
This works, by calling
ReadPdfFile("C:\\Users\\Billy\\Desktop\\All custom flyers\\ALBANY Ketchup Nov 2018 2.pdf", "COLLIN GRADY");

Error on closing an empty iTextSharp document

I am successfully merging PDF documents; now as I'm trying to implement the error handling in case no PDF document has been selected, it throws an error when closing the document: The document has no pages
In case no PDF document has been added in the "foreach" - loop, I still need to close the document!? Or not? If you open an object then it has do be closed at some point. So how to I escape correctly in case no page had been added?
private void MergePDFs()
{
DataSourceSelectArguments args = new DataSourceSelectArguments();
DataView view = (DataView)SourceCertCockpit.Select(args);
System.Data.DataTable table = view.ToTable();
List<PdfReader> readerList = new List<PdfReader>();
iTextSharp.text.Document document = new iTextSharp.text.Document();
PdfCopy copy = new PdfCopy(document, Response.OutputStream);
document.Open();
int index = 0;
foreach (DataRow myRow in table.Rows)
{
if (ListadoCertificadosCockpit.Rows[index].Cells[14].Text == "0")
{
PdfReader Reader = new PdfReader(Convert.ToString(myRow[0]));
Chapter Chapter = new Chapter(Convert.ToString(Convert.ToInt32(myRow[1])), 0);
Chapter.NumberDepth = 0;
iTextSharp.text.Section Section = Chapter.AddSection(Convert.ToString(myRow[10]), 0);
Section.NumberDepth = 0;
iTextSharp.text.Section SubSection = Section.AddSection(Convert.ToString(myRow[7]), 0);
SubSection.NumberDepth = 0;
document.Add(Chapter);
readerList.Add(Reader);
for (int i = 1; i <= Reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(Reader, i));
}
Reader.Close();
}
index++;
}
if (document.PageNumber == 0)
{
document.Close();
return;
}
document.Close();
string SalesID = SALESID.Text;
Response.ContentType = "application/pdf";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
Response.AppendHeader("content-disposition", "attachment;filename=" + SalesID + ".pdf");
}
In the old days, iText didn't throw an exception when you created a document and "forgot" to add any content. This resulted in a document with a single, blank page. This was considered a bug: people didn't like single-page, empty documents. Hence the design decision to throw an exception.
Something similar was done for newPage(). A new page can be triggered explicitly (when you add document.newPage() in your code) or implicitly (when the end of a page is reached). In the old days, this often resulted in unwanted blank pages. Hence the decision to ignore newPage() in case the current page is empty.
Suppose you have this:
document.newPage();
document.newPage();
One may expect that two new pages are created. That's not true. We've made a design decision to ignore the second document.newPage() because no content was added after the first document.newPage().
This brings us to the question: what if we want to insert a blank page? Or, in your case: what if it's OK to create a document with nothing more than a single blank page?
In that case, we have to tell iText that the current page shouldn't be treated as an empty page. You can do so by introducing the following line:
writer.setPageEmpty(false);
Now the current page will be fooled into thinking that it has some content, even though it may be blank.
Adding this line to your code will avoid the The document has no pages exception and solve your problem of streams not being closed.
Take a look at the NewPage example if you want to experiment with the setPageEmpty() method.
You can add an empty page before closing the document, or catch the exception and ignore it.
In case you are still interested in a solution, or may be someone else.
I had exactly the same issue and I workaround-ed it by:
Declaring a boolean to figure out if at least one page have been added and before closing the document I referred on it.
If no pages have been copied, I add a new page in the document thanks to the AddPages method, with a rectangle as parameter. I did not find a simplest way to add a page.
So the code should be as bellow (with possibly some syntax errors as I'm not familiar with C#):
private void MergePDFs()
{
DataSourceSelectArguments args = new DataSourceSelectArguments();
DataView view = (DataView)SourceCertCockpit.Select(args);
System.Data.DataTable table = view.ToTable();
List<PdfReader> readerList = new List<PdfReader>();
iTextSharp.text.Document document = new iTextSharp.text.Document();
PdfCopy copy = new PdfCopy(document, Response.OutputStream);
document.Open();
int index = 0;
foreach (DataRow myRow in table.Rows)
{
if (ListadoCertificadosCockpit.Rows[index].Cells[14].Text == "0")
{
PdfReader Reader = new PdfReader(Convert.ToString(myRow[0]));
Chapter Chapter = new Chapter(Convert.ToString(Convert.ToInt32(myRow[1])), 0);
Chapter.NumberDepth = 0;
iTextSharp.text.Section Section = Chapter.AddSection(Convert.ToString(myRow[10]), 0);
Section.NumberDepth = 0;
iTextSharp.text.Section SubSection = Section.AddSection(Convert.ToString(myRow[7]), 0);
SubSection.NumberDepth = 0;
document.Add(Chapter);
readerList.Add(Reader);
bool AtLeastOnePage = false;
for (int i = 1; i <= Reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(Reader, i));
AtLeastOnePage = true;
}
Reader.Close();
}
index++;
}
if (AtLeastOnePage)
{
document.Close();
return true;
}
else
{
Rectangle rec = new Rectangle(10, 10, 10, 10);
copy.AddPage(rec, 1);
document.Close();
return false;
}
string SalesID = SALESID.Text;
Response.ContentType = "application/pdf";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
Response.AppendHeader("content-disposition", "attachment;filename=" + SalesID + ".pdf");
}

when I read a PDF doc using iTextSharp only get *?* character

I'm trying to read text in a PDF doc using itextsharp library. I have a problem with a particular doc that only returns ? character. However with others doc I have not any problem.
¿What is the reason for that?
Here is my code
private void readPDF()
{
string pdfTemplate = #"c:\\test2.pdf";
// Título de formulario
this.Text += " - " + pdfTemplate;
String strText="";
try
{
PdfReader reader = new PdfReader(pdfTemplate);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
}
reader.Close();
textBox1.Text = strText;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Any ideas?? Thanks

Problem with PdfTextExtractor in itext!

first excuse me for my bad english!
I want to search in pdf document for a word like "Hello" . So I must read each page in pdf by PdfTextExtractor. I did it well. I can read all words in each page separately an save it in string buffer.
but when i push this code in For loop ,(for example from page 1 to 7 for search in it) earlier page's words will remain in string buffer.I hop you understand my problem.
Tanx all.
this is my code :
PdfReader reader2 = new PdfReader(openFileDialog1.FileName);
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < pagen; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader(openFileDialog1.FileName);
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
//MessageBox.Show(s.Length.ToString());
//PdfTextArray h = new PdfTextArray(s);
//
// s = "";
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}
SimpleTextExtractionStrategy doesn't let you reset it unfortunately, so you must move your "new SimpleTextExtractionStrategy()" inside the loop instead of reusing the same object.
There is another potential problem in the statement which controls your loop:
for (int i = 1; i < pagen; i++)
If pagen = 1, the loop is not executed at all. It should read:
for (int i = 1; i <= pagen; i++)
public string ReadPdfFile(object Filename,DataTable ReadLibray)
{
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
}
return strText;
}
This Code is very HelpFull to read PDf using itext

Categories

Resources