Problem with PdfTextExtractor in itext!

Problem with PdfTextExtractor in itext! - c#

first excuse me for my bad english!
I want to search in pdf document for a word like "Hello" . So I must read each page in pdf by PdfTextExtractor. I did it well. I can read all words in each page separately an save it in string buffer.
but when i push this code in For loop ,(for example from page 1 to 7 for search in it) earlier page's words will remain in string buffer.I hop you understand my problem.
Tanx all.
this is my code :
PdfReader reader2 = new PdfReader(openFileDialog1.FileName);
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < pagen; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader(openFileDialog1.FileName);
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
//MessageBox.Show(s.Length.ToString());
//PdfTextArray h = new PdfTextArray(s);
//
// s = "";
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}

SimpleTextExtractionStrategy doesn't let you reset it unfortunately, so you must move your "new SimpleTextExtractionStrategy()" inside the loop instead of reusing the same object.

There is another potential problem in the statement which controls your loop:
for (int i = 1; i < pagen; i++)
If pagen = 1, the loop is not executed at all. It should read:
for (int i = 1; i <= pagen; i++)

public string ReadPdfFile(object Filename,DataTable ReadLibray)
{
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
}
return strText;
}
This Code is very HelpFull to read PDf using itext

Related

Read PDF Line By Line using iText7 and Fill on Textboxes Winforms

I am working on a WinForms application. I use the pdf file to reset the password and the values on pdf are stored as key-value pairs(email: xxxx#mail.com, pass: 11111).
What I want to do:
Read the PDF file line by line and fill the appropriate textboxes.
What I Have done:
public bool CreatePDF(string location, string email, string key)
{
if(location != "" && email != "" && key != "")
{
PdfWriter pdfwriter = new PdfWriter(location);
PdfDocument pdf = new PdfDocument(pdfwriter);
Document document = new Document(pdf);
Paragraph fields = new Paragraph("Email: "+email + "\n" + "Secret Key: "+key);
document.Add(fields);
document.Close();
return true;
}
else
{
return false;
}
}
public string ReadPDF(string location)
{
var pdfDocument = new PdfDocument(new PdfReader(location));
StringBuilder processed = new StringBuilder();
var strategy = new LocationTextExtractionStrategy();
string text = "";
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
text += PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
return text;
}
}
Thank you in advance Guys!. Any suggestions on CreatePDF are also welcome.

This is what I came up with,
var pdfDocument = new PdfDocument(new PdfReader("G:\\Encryption_File.pdf"));
StringBuilder processed = new StringBuilder();
var strategy = new LocationTextExtractionStrategy();
string text = "";
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
text += PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
text.Split('\n');
string line = "";
line = text + "&";
string[] newLines = line.Split('&');
textBox1.Text = newLines[0].Split(':')[1].ToString();
textBox2.Text = newLines[0].Split(':')[2].ToString();

How to Extract Text From a Landscape PDF File

I'm trying to extract the text from the landscape pdf file, I'm using iTextSharp, for Portrait pages, it works well but returns an empty string for Landscape pages.
here is my code
PdfReader reader = new PdfReader(pdfFile);
int intPageNum = reader.NumberOfPages;
var sb = new StringBuilder();
for (int i = 1; i <= intPageNum; i++) {
var text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
sb.Append(text + "\n");
}

when I read a PDF doc using iTextSharp only get ? character

I'm trying to read text in a PDF doc using itextsharp library. I have a problem with a particular doc that only returns ? character. However with others doc I have not any problem.
¿What is the reason for that?
Here is my code
private void readPDF()
{
string pdfTemplate = #"c:\\test2.pdf";
// Título de formulario
this.Text += " - " + pdfTemplate;
String strText="";
try
{
PdfReader reader = new PdfReader(pdfTemplate);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
}
reader.Close();
textBox1.Text = strText;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Any ideas?? Thanks

Adding a new page using iTextSharp

I have a class that build the content for my table of contents and that works, fine and dandy.
Here's the code:
public void Build(string center,IDictionary<string,iTextSharp.text.Image> images)
{
iTextSharp.text.Image image = null;
XPathDocument rapportTekst = new XPathDocument(Application.StartupPath + #"..\..\..\RapportTemplates\RapportTekst.xml");
XPathNavigator nav = rapportTekst.CreateNavigator();
XPathNodeIterator iter;
iter = nav.Select("//dokument/brevhoved");
iter.MoveNext();
var logo = iTextSharp.text.Image.GetInstance(iter.Current.GetAttribute("url", ""));
iter = nav.Select("//dokument/gem_som");
iter.MoveNext();
string outputPath = iter.Current.GetAttribute("url", "")+center+".pdf";
iter = nav.Select("//dokument/titel");
iter.MoveNext();
this.titel = center;
Document document = new Document(PageSize.A4, 30, 30, 100, 30);
var outputStream = new FileStream(outputPath, FileMode.Create);
var pdfWriter = PdfWriter.GetInstance(document, outputStream);
pdfWriter.SetLinearPageMode();
var pageEventHandler = new PageEventHandler();
pageEventHandler.ImageHeader = logo;
pdfWriter.PageEvent = pageEventHandler;
DateTime timeOfReport = DateTime.Now.AddMonths(-1);
pageWidth = document.PageSize.Width - (document.LeftMargin + document.RightMargin);
pageHight = document.PageSize.Height - (document.TopMargin + document.BottomMargin);
document.Open();
var title = new Paragraph(titel, titleFont);
title.Alignment = Element.ALIGN_CENTER;
document.Add(title);
List<TableOfContentsEntry> _contentsTable = new List<TableOfContentsEntry>();
nav.MoveToRoot();
iter = nav.Select("//dokument/indhold/*");
Chapter chapter = null;
int chapterCount = 1;
while (iter.MoveNext())
{
_contentsTable.Add(new TableOfContentsEntry("Test", pdfWriter.CurrentPageNumber.ToString()));
XPathNodeIterator innerIter = iter.Current.SelectChildren(XPathNodeType.All);
chapter = new Chapter("test", chapterCount);
while(innerIter.MoveNext())
{
if (innerIter.Current.Name.ToString().ToLower().Equals("billede"))
{
image = images[innerIter.Current.GetAttribute("navn", "")];
image.Alignment = Image.ALIGN_CENTER;
image.ScaleToFit(pageWidth, pageHight);
chapter.Add(image);
}
if (innerIter.Current.Name.ToString().ToLower().Equals("sektion"))
{
string line = "";
var afsnit = new Paragraph();
line += (innerIter.Current.GetAttribute("id", "") + " ");
innerIter.Current.MoveToFirstChild();
line += innerIter.Current.Value;
afsnit.Add(line);
innerIter.Current.MoveToNext();
afsnit.Add(innerIter.Current.Value);
chapter.Add(afsnit);
}
}
chapterCount++;
document.Add(chapter);
}
document = CreateTableOfContents(document, pdfWriter, _contentsTable);
document.Close();
}
I'm then calling the method CreateTableOfContents(), and as such it is doing what it is supposed to do. Here's the code for the method:
public Document CreateTableOfContents(Document _doc, PdfWriter _pdfWriter, List<TableOfContentsEntry> _contentsTable)
{
_doc.NewPage();
_doc.Add(new Paragraph("Table of Contents", FontFactory.GetFont("Arial", 18, Font.BOLD)));
_doc.Add(new Chunk(Environment.NewLine));
PdfPTable _pdfContentsTable = new PdfPTable(2);
foreach (TableOfContentsEntry content in _contentsTable)
{
PdfPCell nameCell = new PdfPCell(_pdfContentsTable);
nameCell.Border = Rectangle.NO_BORDER;
nameCell.Padding = 6f;
nameCell.Phrase = new Phrase(content.Title);
_pdfContentsTable.AddCell(nameCell);
PdfPCell pageCell = new PdfPCell(_pdfContentsTable);
pageCell.Border = Rectangle.NO_BORDER;
pageCell.Padding = 6f;
pageCell.Phrase = new Phrase(content.Page);
_pdfContentsTable.AddCell(pageCell);
}
_doc.Add(_pdfContentsTable);
_doc.Add(new Chunk(Environment.NewLine));
/** Reorder pages so that TOC will will be the second page in the doc
* right after the title page**/
int toc = _pdfWriter.PageNumber - 1;
int total = _pdfWriter.ReorderPages(null);
int[] order = new int[total];
for (int i = 0; i < total; i++)
{
if (i == 0)
{
order[i] = 1;
}
else if (i == 1)
{
order[i] = toc;
}
else
{
order[i] = i;
}
}
_pdfWriter.ReorderPages(order);
return _doc;
}
The problem is however. I want to insert a page break before the table of contents, for the sake of reordering the pages, so that the table of contents is the first page, naturally. But the output of the pdf-file is not right.
Here's a picture of what it looks like:
It seems like the _doc.NewPage() in the CreateTableOfContents() method does not execute correctly. Meaning that the image and the table of contents is still on the same page when the method starts the reordering of pages.
EDIT: To clarify the above, the _doc.NewPage() gets executed, but the blank page is added after the picture and the table of contents.
I've read a couple of places that this could be because one is trying to insert a new page after an already blank page. But this is not the case.
I'll just link to the pdf files aswell, to better illustrate the problem.
The pdf with table of contents: with table of contents
The pdf without table of contents: without table of contents
Thank you in advance for your help :)

iTextSharp v5 GetTextFromPage() throws IndexOutOfRangeException

Trying to extract the textual content of a pdf with the following code:
PdfReader reader = new PdfReader(path);
string strText = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
string s = PdfTextExtractor.GetTextFromPage(reader, page);
strText += " " + s;
}
reader.Close();
NumberOfPages returns 257, but at page 227, GetTextFromPage() throws a IndexOutOfRangeException.
Any help is appreciated.
hofnarwillie

I resolved this issue by updating my version of iTextSharp from 5.1 to 5.2.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with PdfTextExtractor in itext! - c#

SimpleTextExtractionStrategy doesn't let you reset it unfortunately, so you must move your "new SimpleTextExtractionStrategy()" inside the loop instead of reusing the same object.

There is another potential problem in the statement which controls your loop: for (int i = 1; i < pagen; i++) If pagen = 1, the loop is not executed at all. It should read: for (int i = 1; i <= pagen; i++)

Related

Read PDF Line By Line using iText7 and Fill on Textboxes Winforms

How to Extract Text From a Landscape PDF File

when I read a PDF doc using iTextSharp only get ? character

Adding a new page using iTextSharp

iTextSharp v5 GetTextFromPage() throws IndexOutOfRangeException

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with PdfTextExtractor in itext! - c#

SimpleTextExtractionStrategy doesn't let you reset it unfortunately, so you must move your "new SimpleTextExtractionStrategy()" inside the loop instead of reusing the same object.

There is another potential problem in the statement which controls your loop: for (int i = 1; i < pagen; i++) If pagen = 1, the loop is not executed at all. It should read: for (int i = 1; i <= pagen; i++)

Related

Read PDF Line By Line using iText7 and Fill on Textboxes Winforms

How to Extract Text From a Landscape PDF File

when I read a PDF doc using iTextSharp only get *?* character

Adding a new page using iTextSharp

iTextSharp v5 GetTextFromPage() throws IndexOutOfRangeException

Categories

Resources

when I read a PDF doc using iTextSharp only get ? character