Extract entire text from PDF with iTextSharp

Extract entire text from PDF with iTextSharp - c#

I'm trying to parse PDF documents in order for certain values be added to an existing database. The problem is with parsing the PDF.
First try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
String text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
}
}
}
But unfortunately that only parsed the text after the titles (Employer, Website, Language etc). And I need the titles in order to create a class which will be mapped to a relation in the database.
Second try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
byte[] streamBytes = reader.GetPageContent(page);
PRTokeniser tokenizer = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(streamBytes)));
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
String text = tokenizer.StringValue;
}
}
}
}
}
Fortunately, this parsed the missing titles, but it parsed them first (words in new lines instead of single line) and the value afterwards.
iTextSharp documentation?
There must be classes in iTextSharp which can find the titles/values pair. Or at least parse the titles in readable format. I am happy to write my own implementation of ITextExtractionStrategy.

iTextSharp does not have an official documentation page, but you can find some answers here on SO. Instead of getting the data from the PDF in a String, try parsing it as XML and then use XPath to get the data you need. Or you can use Linq to XML. I'm guessing that each page in the PDF has the same format, so the XML structure can have the same format as well.
Here is a project sample using iTextSharp and here is a SDK (paid) taht you can use, but if you want it free it's a temporary solution.

Related

Parse PDF file to memory and perform search for certain value

I am rather new to the whole C# thing and trying to learn it in more practical way to gather more interest and understanding. I have a code that is parsing PDF https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf file and functioning good. However I would like to write to memory instead of console, in order to search for InvoiceNumber from it later.
My current code for writing into console:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace PDF_file_reader
{
class Program
{
static void Main(string[] args)
{
List<int> InvoiceNumbers = new List<int>();
string filePath = #"C:\temp\parser\Invoice_Template.pdf";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
{
//Console.WriteLine($"<{line}>");
Console.WriteLine(line.ToString());
}
}
Console.Read();
}
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
Here is an output in console:
How to write to InvoiceNumbers list instead of Console what I am doing now and perform search out of it? I guess with my current setup search would be not possible?

Just a note, you have an extra set of { } in your foreach loop surrounding Console.Writeline() that you can remove.
If you want to store the whole invoice number as it is highlighted in your screenshot ("INV-3337" instead of just "3337"), InvoiceNumbers needs to be a list of strings, not ints.
I assume the invoice is always going to be the same, or the number is always going to be the same format (i.e. "Invoice Number 'INV-####"), you could just add a line in your foreach loop. Since each line is a string, you can check if line contains "Invoice Number". If it does, you can add it to InvoiceNumbers and remove the phrase "Invoice Number". Then trim it to get rid of any whitespace. Either above or below Console.Writeline(line.ToString()); you would just add:
if (line.Contains("Invoice Number"))
InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());
(I used Replace() instead of Remove() because you would either need to know the start and end positions of the phrase you want to remove. In my opinion, Replace() is the safest route for this particular situation)
You can add break; to the if statement if that's all you're looking for as well. This will stop the foreach loop. Once you extract the invoice number, there is no reason to look through the rest of the document, unless you have multiple invoices in one document.
if (line.Contains("Invoice Number"))
{
InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());
break;
}
If you want to search through the list for a particular invoice number, this answer should help with that.
This is assuming that the only difference would be the actual number. If it's not, you could always look into regular expressions and have it look for a pattern like "INV-\d*". That would also be assuming the invoice number format is always the same.

Paragraph Reading in PDF

In my code, I need to read the PDF file content and based on some specific requirement I need to insert the content of PDF into SQL server DB.
I used iTextsharp for PDF reading. It reads well when it found the entire line in PDF.
Problems come when they found a table inside the PDF.
It first gets into column1 and reads the line and jumps into column2 and reads that line and so on.
Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which have no meaning.
I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line.
After processing column1 then jumps into colum2.
Currently I am using below code:
PdfReader reader = new PdfReader(#"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;
StringBuilder text = new StringBuilder();
for (int i = 1; i <= PageNum; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
ReadContent(text.ToString());
text.Clear();
}

Search Text and highlight it

My code is in C#
I am using Aspose to search text and highlight it in pdf.
It is working but the time taken is very huge.
Example : My document has 25 pages and it has 25 instance of search text , 1 search text in each page.
It take 2 minutes which is unacceptable.
I have 3 questions:
Is it a way to reduce this time taken ?
Currently this approach is for pdf, in my case i have all types of doc (xls, pdf, ppt, doc)? Is there any way where this search and highlighting can be performed in all docs ?
Is there some better way of doing it other than aspose ?
// open document
Document document = new Document(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\OpenAML.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Martin");
//accept the absorber for all the pages
for (int i = 1; i <= document.Pages.Count; i++)
{
document.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
// textFragment.TextState.Invisible = false;
//textFragment.Text = "TEXT";
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 9;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
//textFragment.TextState.Underline = true;
}
}
// Save resulting PDF document.
document.Save(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\Highlightdoc.pdf");

Extract words from a doc/docx file c#

I want to extract all the words from a Word file (doc/docx) and put them into a list. It seems like microsoft.Office.Interop works just if i want to extract paragraphs and add them into a list.
List<string> data = new List<string>();
Microsoft.Office.Interop.Word.Application app = new
Microsoft.Office.Interop.Word.Application();
Document doc = app.Documents.Open(dlg.FileName);
foreach (Paragraph objParagraph in doc.Paragraphs)
data.Add(objParagraph.Range.Text.Trim());
((_Document)doc).Close();
((_Application)app).Quit();`
I also found the way to extract word by word but it didn't works with big document because of the loop that generates an exception.
`Dictionary<int, string> motRap = new Dictionary<int, string>();
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open("C:/Users/Titri/Desktop/test/test/bin/Debug/po.txt");
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
string text = document.Words[i].Text;
motRap.Add(i, text);
}
// Close word.
application.Quit();`
So my question is, if there is a way to extract words from a big word file. I think that Microsoft.Office.Interop is not the good tool to extract from a big file.
Sorry my english is not good.

The object inside a paragraph is called Run, though I don't know whether or not this is available in Interop. To enhance your experience performancewise, I would suggest you switch to using OpenXmlSdk, in case you have to process a large amount of documents.
If you want to stick to Interop, why don't you just split each paragraph into an array (delimiter obviously space) and add all the words after that?

find page number of a string in pdf file in c#

I am developing a pdf reader. i want to find any string in pdf and to know the corresponding page number. I am using iTextSharp.

Something like this should work:
// add any string you want to match on
Regex regex = new Regex("the",
RegexOptions.IgnoreCase | RegexOptions.Compiled
);
PdfReader reader = new PdfReader(pdfPath);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.NumberOfPages; i++) {
ITextExtractionStrategy strategy = parser.ProcessContent(
i, new SimpleTextExtractionStrategy()
);
if ( regex.IsMatch(strategy.GetResultantText()) ) {
// do whatever with corresponding page number i...
}
}

In order to use Itextsharp you can use Acrobat.dll to find the current page number. First of all open the pdf file and search the string usingL
Acroavdoc.open("Filepath","Temperory title")
and
Acroavdoc.FindText("String").
If the string found in this pdf file then the cursor moved into the particular page and the searched string will be highlighted. Now we use Acroavpageview.GetPageNum() to get the current page number.
Dim AcroXAVDoc As CAcroAVDoc
Dim Acroavpage As AcroAVPageView
Dim AcroXApp As CAcroApp
AcroXAVDoc = CType(CreateObject("AcroExch.AVDoc"), Acrobat.CAcroAVDoc)
AcroXApp = CType(CreateObject("AcroExch.App"), Acrobat.CAcroApp)
AcroXAVDoc.Open(TextBox1.Text, "Original document")
AcroXAVDoc.FindText("String is to searched", True, True, False)
Acroavpage = AcroXAVDoc.GetAVPageView()
Dim x As Integer = Acroavpage.GetPageNum
MsgBox("the string found in page number" & x)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract entire text from PDF with iTextSharp - c#

Related

Parse PDF file to memory and perform search for certain value

Paragraph Reading in PDF

Search Text and highlight it

Extract words from a doc/docx file c#

find page number of a string in pdf file in c#

Categories

Resources