How to Extract Text From a Landscape PDF File

How to Extract Text From a Landscape PDF File - c#

I'm trying to extract the text from the landscape pdf file, I'm using iTextSharp, for Portrait pages, it works well but returns an empty string for Landscape pages.
here is my code
PdfReader reader = new PdfReader(pdfFile);
int intPageNum = reader.NumberOfPages;
var sb = new StringBuilder();
for (int i = 1; i <= intPageNum; i++) {
var text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
sb.Append(text + "\n");
}

Related

Read PDF Line By Line using iText7 and Fill on Textboxes Winforms

I am working on a WinForms application. I use the pdf file to reset the password and the values on pdf are stored as key-value pairs(email: xxxx#mail.com, pass: 11111).
What I want to do:
Read the PDF file line by line and fill the appropriate textboxes.
What I Have done:
public bool CreatePDF(string location, string email, string key)
{
if(location != "" && email != "" && key != "")
{
PdfWriter pdfwriter = new PdfWriter(location);
PdfDocument pdf = new PdfDocument(pdfwriter);
Document document = new Document(pdf);
Paragraph fields = new Paragraph("Email: "+email + "\n" + "Secret Key: "+key);
document.Add(fields);
document.Close();
return true;
}
else
{
return false;
}
}
public string ReadPDF(string location)
{
var pdfDocument = new PdfDocument(new PdfReader(location));
StringBuilder processed = new StringBuilder();
var strategy = new LocationTextExtractionStrategy();
string text = "";
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
text += PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
return text;
}
}
Thank you in advance Guys!. Any suggestions on CreatePDF are also welcome.

This is what I came up with,
var pdfDocument = new PdfDocument(new PdfReader("G:\\Encryption_File.pdf"));
StringBuilder processed = new StringBuilder();
var strategy = new LocationTextExtractionStrategy();
string text = "";
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
text += PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
text.Split('\n');
string line = "";
line = text + "&";
string[] newLines = line.Split('&');
textBox1.Text = newLines[0].Split(':')[1].ToString();
textBox2.Text = newLines[0].Split(':')[2].ToString();

Hebrew characters are printed in reverse order in iText 7

I'm using iText 7 for printing some text in List (in Hebrew) to PDF file, but while printing the Hebrew characters are printed in a reverse way.
The input given was "קניון אם הדרך" while the data printed in PDF is "ךרדה םא ןוינק"
I have set the font encoding and the base direction for the elements but still the characters are printed in the reverse order.
Code to print list into the table in PDF:
public void PrintListData(string filename)
{
var writer = new PdfWriter(filename);
PdfDocument pdfDoc = new PdfDocument(writer);
Document doc = new Document(pdfDoc, PageSize.A4);
SetFont(doc);
Table table = new Table(UnitValue.CreatePercentArray(2)).UseAllAvailableWidth();
table.SetTextAlignment(TextAlignment.RIGHT);
//table.SetBaseDirection(BaseDirection.RIGHT_TO_LEFT);
var index = 1;
foreach (var item in _data)
{
var para = new Paragraph(item);
para.SetTextAlignment(TextAlignment.RIGHT);
//para.SetBaseDirection(BaseDirection.RIGHT_TO_LEFT);
table.AddCell(index.ToString());
table.AddCell(para);
index++;
}
doc.Add(table);
doc.Close();
Process.Start(filename);
}
Set font function
void SetFont(Document doc)
{
FontSet _defaultFontSet = new FontSet();
var fontFolders = Environment.GetFolderPath(Environment.SpecialFolder.Fonts);
_defaultFontSet.AddFont(System.IO.Path.Combine(fontFolders, "arial.ttf"), PdfEncodings.IDENTITY_H);
_defaultFontSet.AddFont(System.IO.Path.Combine(fontFolders, "arialbd.ttf"), PdfEncodings.IDENTITY_H);
doc.SetFontProvider(new FontProvider(_defaultFontSet));
doc.SetProperty(Property.FONT, new String[] { "MyFontFamilyName" });
//doc.SetProperty(Property.BASE_DIRECTION, BaseDirection.RIGHT_TO_LEFT);
}
I'm not sure to where should we set the text direction to print RTL.

Search Multiple PDFs in a directory for a string in C# using itext7

I am trying to search for text in each PDF inside of a directory using itext7. I can figure out how to search just one PDF.
I managed to search one pdf using the below code, how can I make this work for each PDF in a directory?
public List<int> ReadPdfFile(string fileName, String searchString)
{
List<int> pages = new List<int>();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searchString))
{
MessageBox.Show("Found COLLIN GRADY");
}
else
{
MessageBox.Show("Could not find COLLIN GRADY");
}
}
pdfReader.Close();
}
return pages;
}
This works, by calling
ReadPdfFile("C:\\Users\\Billy\\Desktop\\All custom flyers\\ALBANY Ketchup Nov 2018 2.pdf", "COLLIN GRADY");

How to Add bookmarks to PDF file?

i have 4 pdf templates files by using itextsharp i added values and i merged/added 4 pdf files into single document, so all 4 pages are under one single pdf file name.Now i want to add bookmark to my pdf file. is there any way to do in C# ?for better understanding ,please refer below images
Hi ,this is what i am trying to do, i am not getting any error but still there is no bookmark in my pdf, i want to add bookmark with 4 sections as showed in the image.after merging i want add bookmark to final pdf.
enter code herepublic string MergePDFs()
{
string outPutFilePath = #"D:\jeldsbre.pdf";
string genereatedpdfs = #"D:\genereatedpdfs";
using (FileStream stream = new FileStream(outPutFilePath, FileMode.Create))
{
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdf.SetMergeFields();
pdfDoc.Open();
var files = Directory.GetFiles(genereatedpdfs);
Console.WriteLine("Merging files count: " + files.Length);
int i = 1;
foreach (string file in files)
{
Console.WriteLine(i + ". Adding: " + file);
pdf.AddDocument(new PdfReader(file));
i++;
}
List<Dictionary<string, object>> bookmarks = new List<Dictionary<string, object>>();
IList<Dictionary<string, object>> tempBookmarks = new List<Dictionary<string, object>>();
SimpleBookmark.ShiftPageNumbers(tempBookmarks, 1, null);
bookmarks.AddRange(tempBookmarks);
SimpleBookmark.ShiftPageNumbers(tempBookmarks, 3, null);
bookmarks.AddRange(tempBookmarks);
pdf.Outlines = bookmarks;
if (pdfDoc != null)
pdfDoc.Close();
string base64 = GetBase64(outPutFilePath);
return base64;
}
}

Assuming that your original PDFs already have bookmarks, then you should concatenate not only the documents (using the PdfCopy class), you should also concatenate the different bookmarks structures of the different files (using the SimpleBookMark class), not forgetting to take into account that you need to shift the page numbers correctly.
This is done in the ConcatenateBookmarks example in chapter 7 of my book:
// Create a list for the bookmarks
ArrayList<HashMap<String, Object>> bookmarks = new ArrayList<HashMap<String, Object>>();
List<HashMap<String, Object>> tmp;
for (int i = 0; i < src.length; i++) {
reader = new PdfReader(src[i]);
// merge the bookmarks
tmp = SimpleBookmark.getBookmark(reader);
SimpleBookmark.shiftPageNumbers(tmp, page_offset, null);
bookmarks.addAll(tmp);
// add the pages
n = reader.getNumberOfPages();
page_offset += n;
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
}
// Add the merged bookmarks
copy.setOutlines(bookmarks);
For a C# version of this example, please take a look at http://tinyurl.com/itextsharpIIA2C07 for the corresponding iTextSharp example:
// Create a list for the bookmarks
List<Dictionary<String, Object>> bookmarks =
new List<Dictionary<String, Object>>();
for (int i = 0; i < src.Count; i++) {
PdfReader reader = new PdfReader(src[i]);
// merge the bookmarks
IList<Dictionary<String, Object>> tmp =
SimpleBookmark.GetBookmark(reader);
SimpleBookmark.ShiftPageNumbers(tmp, page_offset, null);
foreach (var d in tmp) bookmarks.Add(d);
// add the pages
int n = reader.NumberOfPages;
page_offset += n;
for (int page = 0; page < n; ) {
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
}
// Add the merged bookmarks
copy.Outlines = bookmarks;
If the existing documents don't have any bookmarks (or if you don't want to copy any existing documents), then your question is a duplicate of a question I answered half a year ago: Merge pdfs and add bookmark with iText in java

Problem with PdfTextExtractor in itext!

first excuse me for my bad english!
I want to search in pdf document for a word like "Hello" . So I must read each page in pdf by PdfTextExtractor. I did it well. I can read all words in each page separately an save it in string buffer.
but when i push this code in For loop ,(for example from page 1 to 7 for search in it) earlier page's words will remain in string buffer.I hop you understand my problem.
Tanx all.
this is my code :
PdfReader reader2 = new PdfReader(openFileDialog1.FileName);
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < pagen; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader(openFileDialog1.FileName);
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
//MessageBox.Show(s.Length.ToString());
//PdfTextArray h = new PdfTextArray(s);
//
// s = "";
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}

SimpleTextExtractionStrategy doesn't let you reset it unfortunately, so you must move your "new SimpleTextExtractionStrategy()" inside the loop instead of reusing the same object.

There is another potential problem in the statement which controls your loop:
for (int i = 1; i < pagen; i++)
If pagen = 1, the loop is not executed at all. It should read:
for (int i = 1; i <= pagen; i++)

public string ReadPdfFile(object Filename,DataTable ReadLibray)
{
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
}
return strText;
}
This Code is very HelpFull to read PDf using itext

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to Extract Text From a Landscape PDF File - c#

Related

Read PDF Line By Line using iText7 and Fill on Textboxes Winforms

Hebrew characters are printed in reverse order in iText 7

Search Multiple PDFs in a directory for a string in C# using itext7

How to Add bookmarks to PDF file?

Problem with PdfTextExtractor in itext!

Categories

Resources