Paragraph Reading in PDF

Paragraph Reading in PDF - c#

In my code, I need to read the PDF file content and based on some specific requirement I need to insert the content of PDF into SQL server DB.
I used iTextsharp for PDF reading. It reads well when it found the entire line in PDF.
Problems come when they found a table inside the PDF.
It first gets into column1 and reads the line and jumps into column2 and reads that line and so on.
Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which have no meaning.
I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line.
After processing column1 then jumps into colum2.
Currently I am using below code:
PdfReader reader = new PdfReader(#"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;
StringBuilder text = new StringBuilder();
for (int i = 1; i <= PageNum; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
ReadContent(text.ToString());
text.Clear();
}

Related

Word found unreadable content in docm

I am writing a program in C# using Open XML that transfers data from excel to word.
Currently, I have this:
internal override void UpdateSectionSheets(int sectionNum, List<List<string>> tableContents)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, true))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach(Table table in tables)
{
int row = 1;
if (table.Descendants<TableRow>().FirstOrDefault().Descendants<TableCell>().FirstOrDefault().InnerText == sectionNum.ToString())
{
foreach(var item in tableContents[0])
{
// splits the tableContents[0][row - 1] into individual strings at each instance of "\n\n"
String str = tableContents[0][row - 1];
String[] separator = {"\n\n"};
Int32 count = 6; // max 6 sub strings (usually only two but allowed for extra)
String[] subStrs = str.Split(separator, count, StringSplitOptions.RemoveEmptyEntries);
// transfer comment
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).RemoveAllChildren<Paragraph>(); // removes the existing contents in the cell
foreach (string s in subStrs)
{
// for every substring, create a new paragraph and append the substring to that new paragraph. Makes it so that each sentence is on its own line
Text text = new Text(s);
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
}
// transfer verdict
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).RemoveAllChildren<Paragraph>();
Paragraph p = new Paragraph(new ParagraphProperties(new Justification() { Val = JustificationValues.Center }));
p.Append(new Run(new Text(tableContents[1][row - 1])));
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).AppendChild(p);
row++;
}
}
}
doc.Save();
}
}
I believe the line causing the issue is: table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
If I put new Text(tableContents[0][row - 1]) in place of (text) in the above line, the program will run and word doc will open with no errors, but the output is not in the format I need.
The program runs without throwing any errors, but when I try to open the word doc it gives a "word found unreadable content in xxx.docm" error. If I say I trust the source and want word to recover the document, I can open the doc and see that the code is working how I want. However, I don't want to have to do that every time. Does anyone know what is causing the error and how I can fix it?

Search Text and highlight it

My code is in C#
I am using Aspose to search text and highlight it in pdf.
It is working but the time taken is very huge.
Example : My document has 25 pages and it has 25 instance of search text , 1 search text in each page.
It take 2 minutes which is unacceptable.
I have 3 questions:
Is it a way to reduce this time taken ?
Currently this approach is for pdf, in my case i have all types of doc (xls, pdf, ppt, doc)? Is there any way where this search and highlighting can be performed in all docs ?
Is there some better way of doing it other than aspose ?
// open document
Document document = new Document(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\OpenAML.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Martin");
//accept the absorber for all the pages
for (int i = 1; i <= document.Pages.Count; i++)
{
document.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
// textFragment.TextState.Invisible = false;
//textFragment.Text = "TEXT";
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 9;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
//textFragment.TextState.Underline = true;
}
}
// Save resulting PDF document.
document.Save(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\Highlightdoc.pdf");

iText C# exact text displaying

I am trying to insert the exact text with the spaces at the beginning of line, however iText eats all the spaces before the first visible symbol (tabulation does't work as well).
I am using iText 7 Community edition.
C# code:
FileInfo file = new FileInfo(DEST);
file.Directory.Create();
//Initialize PDF writer
PdfWriter writer = new PdfWriter(DEST);
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document doc = new Document(pdf);
doc.Add(new Paragraph("Test\n\tTest\n Test\n Test 1 2 3"));
doc.Close();
That code display the text in the output .pdf document as
Test
Test
Test
Test 1 2 3
Without any tabs and spaces before the fist visible symbol of each line.
How can I change code to get
Test
Test
Test
Test 1 2 3
in the output document?

In your code example, (embedded) tabs wouldn't work in iTextSharp 5.xx.xx either, although spaces are respected. What's a little surprising, as you've proved, is that iText7 strips spaces following a newline. Not sure if you need support for either or both, so will give an example that handles each case separately:
First, preserving tabs:
Paragraph p = new Paragraph("Line 0\n")
.AddTabStops(new TabStop(8f))
// change to your needs ^^
.Add(new Tab())
.Add("Line 1");
doc.Add(p);
Second, preserving spaces immediately following a newline:
string[] lines = "0\n1\n 2\n 3\n".Split(
new string[] { "\n" },
StringSplitOptions.RemoveEmptyEntries
);
p = new Paragraph().AddStyle(
new Style().SetFont(PdfFontFactory.CreateFont(FontConstants.COURIER))
);
foreach (var l in lines)
{
if (Regex.IsMatch(l, #"^\s+"))
{
p.Add(" ") // all spaces stripped, whether one or more characters
.Add(l) // now leading whitespace preserved
.Add("\n");
}
else
{
p.Add(l).Add("\n");
}
}
doc.Add(p);
This is the first time I've looked at/written any iText7, so there's likely a different/better way, and I don't consider it anything but a workaround. Oddly, if you add any number of space characters following a newline and then immediately add a string that's also preceded with space characters, the first call strips space, but the second preserves them.
As a side note, one thing I noticed right away and really like about the new API is you can use method chaining all over the place. :)
Here's the result:

You should use Chunks to add text to a paragraph.
Then you should set tab settings and use specific Chunk.TABBING
p = new Paragraph();
p.setTabSettings(new TabSettings(56f));
p.add(Chunk.TABBING);
p.add(new Chunk("Hello World with tab."));
This sample is located at iText examples

Just try this.
Font bodyFont = FontFactory.GetFont("Times New Roman", 10, Font.NORMAL);
file.Directory.Create();
//Initialize PDF writer
PdfWriter writer = new PdfWriter(DEST);
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document doc = new Document(pdf);
doc.Add(new Paragraph("Test", bodyFont));
doc.Add(new Paragraph(" Test", bodyFont));
doc.Add(new Paragraph(" Test", bodyFont));
doc.Add(new Paragraph(" Test 1 2 2", bodyFont));
doc.Close();

Extract entire text from PDF with iTextSharp

I'm trying to parse PDF documents in order for certain values be added to an existing database. The problem is with parsing the PDF.
First try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
String text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
}
}
}
But unfortunately that only parsed the text after the titles (Employer, Website, Language etc). And I need the titles in order to create a class which will be mapped to a relation in the database.
Second try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
byte[] streamBytes = reader.GetPageContent(page);
PRTokeniser tokenizer = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(streamBytes)));
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
String text = tokenizer.StringValue;
}
}
}
}
}
Fortunately, this parsed the missing titles, but it parsed them first (words in new lines instead of single line) and the value afterwards.
iTextSharp documentation?
There must be classes in iTextSharp which can find the titles/values pair. Or at least parse the titles in readable format. I am happy to write my own implementation of ITextExtractionStrategy.

iTextSharp does not have an official documentation page, but you can find some answers here on SO. Instead of getting the data from the PDF in a String, try parsing it as XML and then use XPath to get the data you need. Or you can use Linq to XML. I'm guessing that each page in the PDF has the same format, so the XML structure can have the same format as well.
Here is a project sample using iTextSharp and here is a SDK (paid) taht you can use, but if you want it free it's a temporary solution.

find page number of a string in pdf file in c#

I am developing a pdf reader. i want to find any string in pdf and to know the corresponding page number. I am using iTextSharp.

Something like this should work:
// add any string you want to match on
Regex regex = new Regex("the",
RegexOptions.IgnoreCase | RegexOptions.Compiled
);
PdfReader reader = new PdfReader(pdfPath);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.NumberOfPages; i++) {
ITextExtractionStrategy strategy = parser.ProcessContent(
i, new SimpleTextExtractionStrategy()
);
if ( regex.IsMatch(strategy.GetResultantText()) ) {
// do whatever with corresponding page number i...
}
}

In order to use Itextsharp you can use Acrobat.dll to find the current page number. First of all open the pdf file and search the string usingL
Acroavdoc.open("Filepath","Temperory title")
and
Acroavdoc.FindText("String").
If the string found in this pdf file then the cursor moved into the particular page and the searched string will be highlighted. Now we use Acroavpageview.GetPageNum() to get the current page number.
Dim AcroXAVDoc As CAcroAVDoc
Dim Acroavpage As AcroAVPageView
Dim AcroXApp As CAcroApp
AcroXAVDoc = CType(CreateObject("AcroExch.AVDoc"), Acrobat.CAcroAVDoc)
AcroXApp = CType(CreateObject("AcroExch.App"), Acrobat.CAcroApp)
AcroXAVDoc.Open(TextBox1.Text, "Original document")
AcroXAVDoc.FindText("String is to searched", True, True, False)
Acroavpage = AcroXAVDoc.GetAVPageView()
Dim x As Integer = Acroavpage.GetPageNum
MsgBox("the string found in page number" & x)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Paragraph Reading in PDF - c#

Related

Word found unreadable content in docm

Search Text and highlight it

iText C# exact text displaying

Extract entire text from PDF with iTextSharp

find page number of a string in pdf file in c#

Categories

Resources