iTextSharp can't read some PDF files

iTextSharp can't read some PDF files - c#

I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?

You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.

Related

How to access OpenXML content by page number?

Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.

List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

Edit Total Number of Pages in Footer SelectPDF

I am converting html page to pdf using HtmlToPdf() of SelectPDF. Since html content is big, I am breaking it in half and creating 2 PDFs.
I am struggling to edit the total_pages in the footer to display actual total number of the pages, not only the current document; as well as page_number to display the actual page number in the context of both PDFs.
How can I assess {page_number} and {total_pages} to calculate proper values? All examples I found use PdfDocument(), not HtmlToPdf().
Dim converter As New HtmlToPdf()
Dim text As New PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ")
text.HorizontalAlign = PdfTextHorizontalAlign.Center
converter.Footer.Add(text)
I am tagging both C# and VB since SelectPDF is for both languages, and relevant sample from either one will work for me. Thank you

Today I've stumbled upon the same issue and I have found a work-around for the problem. The converter was able to show page numbers for it's the generated document but can't be aware of multiple generated files (you can't access the page properties) so all my pages I concatenated were showing Page 1 of 1.
First I define one PdfDocument (see it as the main document) and I use HtmlToPdf to append html converted files to this main document.
// Create converter
converter = new HtmlToPdf();
PdfTextSection text = new PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ", new Font("Arial", 8));
text.HorizontalAlign = PdfTextHorizontalAlign.Right;
converter.Footer.Add(text);
// Create main document
pdfDocument = new PdfDocument();
Then I add pages (from html) using this method
public void AddPage(string htmlPage)
{
PdfDocument doc = converter.ConvertHtmlString(htmlPage);
pdfDocument.Append(doc);
converter.Footer.TotalPagesOffset += doc.Pages.Count;
converter.Footer.FirstPageNumber += doc.Pages.Count;
}
This results in correct page numbers for the main document. The same trick could be used for splitting files and page numbers over multiple documents like you described.
EDIT: In case you don't see any page numbering using the HtmlToPdf converter, don't forget to set following property:
converter.Options.DisplayFooter = true;

There is an open source library called itextsharp that will help get total page count.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
namespace GetPages_PDF
{
class Program
{
static void Main(string[] args)
{
// Right side of equation is location of YOUR pdf file
string ppath = "C:\\aworking\\Hawkins.pdf";
PdfReader pdfReader = new PdfReader(ppath);
int numberOfPages = pdfReader.NumberOfPages;
Console.WriteLine(numberOfPages);
Console.ReadLine();
}
}
}
Then you can stamp text also on the page but you will need to add the location to where it needs to go.
link: http://crmhunt.com/how-to-modify-pdf-file-using-itextsharp/
hope this helps in some way.

You should use the following properties:
FirstPageNumber - Controls the page number for the first page being
rendered.
TotalPagesOffset - Controls the total number of pages
offset in the generated pdf document.
More details here:
http://selectpdf.com/html-to-pdf/docs/html/HtmlToPdfHeadersAndFooters.htm

The answers above did not work for me as I was trying to merge multiple PDFs with different orientations. bonnoj's answer did add page numbers but they were incorrect and I couldn't find a way to correct them. So I took a different approach - I created a PDF, then for each HTML page I added a pdfPage and then added a PdfHtmlElement to that page. Finally I loop over the pages and add a custom footer to each page. This may not be the most efficient way to do this but it's the only way that I could find that added the footer in the correct place when mixing portrait and landscape pages. Hopefully it will save somebody else spending hours playing with different properties.
var pdfDocument = new PdfDocument(PdfStandard.Full);
foreach (var (html, pdfPageOrientation) in pages)
{
var page = pdfDocument.AddPage(PdfCustomPageSize.A4, new PdfMargins(marginLeft, marginRight, marginTop, marginBottom));
page.Orientation = pdfPageOrientation;
var pdfHtmlElement = new PdfHtmlElement(html, "");
page.Add(pdfHtmlElement);
}
var pdfFont = pdfDocument.AddFont(PdfStandardFont.Helvetica);
pdfFont.Size = 12;
foreach (PdfPage page in pdfDocument.Pages)
{
var customFooter = pdfDocument.AddTemplate(page.PageSize.Width, 30);
var pdfFooterTextElement = new PdfTextElement(0, 15,
pageFooterText,
pdfFont)
{
HorizontalAlign = PdfTextHorizontalAlign.Right,
VerticalAlign = PdfTextVerticalAlign.Bottom,
};
customFooter.Add(pdfFooterTextElement);
page.CustomFooter = customFooter;
}
pdfDocument.Save(stream);

Generate PDF from large html

I am generating a PDF from an HTML string.
When this string is really long, I would like to create a new page, split the text (without breaking the html) and so on.
Here is my code :
// instantiate Pdf object
Aspose.Pdf.Generator.Pdf pdf = new Aspose.Pdf.Generator.Pdf();
// specify the Character encoding for for HTML file
pdf.HtmlInfo.CharSet = "UTF-8";
pdf.HtmlInfo.Margin.Left = 10;
pdf.HtmlInfo.Margin.Right = 10;
pdf.HtmlInfo.PageHeight = 1050;
pdf.HtmlInfo.PageWidth = 730;
pdf.HtmlInfo.ShowUnknownHtmlTagsAsText = true;
pdf.HtmlInfo.TryEnlargePredefinedTableColumnWidthsToAvoidWordBreaking = true;
pdf.HtmlInfo.CharsetApplyingLevelOfForce = Aspose.Pdf.Generator.HtmlInfo.CharsetApplyingForceLevel.UseWhenImpossibleDetectFromContent;
// bind the source HTML
pdf.BindHTML("MyVeryVeryLongHTML");
MemoryStream stream = new MemoryStream();
pdf.Save(stream);
byte[] pdfBytes = stream.ToArray();
This code works for the HTML, but the overflow is not handled. The text continue after the page. Is it possible to set a max "height" of the page to not cross, and if it does, it recreates a new page ?
Hope it makes sense !
Thanks a lot

You can set the Page height by selecting type of PDF page you require like A1, A2, etc . Afterwords , your problem of page height will automatically be taken care by the Aspose. For more refer the link..
Aspose PDF Page Height
Update
update pdf.HtmlInfo to pdf.PageSetup (or pdf.PageInfo) and add bottom margin also.

How to extract text from a PDF and decode characters?

I am using itextsharp to extract text from a pdf document using this code:
public static bool does_document_text_have_keyword(string keyword,
string pdf_src, Report report_object) // TEST
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage
(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString
(ASCIIEncoding.Convert
(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
report_object.log(currentText); // TEST
if (currentText.IndexOf
(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}
But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html
If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
Try using the LocationTextExtractionStrategy. It's documentation states:
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

Reading pdf content using iTextSharp in C#

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic Result is something like this :
Here is sample non-English PDF for test.
ÙŽÙ›Ù†Ø§ ÙÙ”Ø¨Ù˜Ø·Ø« ÛŒØ¿ÛŒÙ›Ù˜ Ø²Ø¾Ø§ ÙÙ›ÙØÙ” Ù‚Ù›Ù…Ø
ÛŒÙ”Ø¨Ù•Ø³ Â© Karl Seguin foppersian.codeplex.com
www.codebetter.com 1 1 ÙÙ”Ø¨Ù˜Ø·Ø« ÙŽÙ›Ù†Ø§ ÛŒØ¿ÛŒÙ›Ù˜
Ù‡Ù…Ø§Ù†Ø±Ø¨ Ù„ÙˆØµØ§ ÛŒØ³ÛŒÙˆÙ† Ù…Ø±Ù† Ø¯ÛŒÙ„ÙˆØª Ø±ØªÙ‡Ø¨ Ø±Ø§Ø²ÙØ§
What is the solution ?
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.
Your problem is this line:
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
I'm going to pull it apart into a couple of lines to illustrate:
byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ÛŒ
The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.
Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.
EDIT
The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as #kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.
public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();
if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
EDIT 2
The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.
Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.
PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings
When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

iTextSharp can't read some PDF files - c#

Related

How to access OpenXML content by page number?

Edit Total Number of Pages in Footer SelectPDF

Generate PDF from large html

How to extract text from a PDF and decode characters?

Reading pdf content using iTextSharp in C#

Categories

Resources