PDFBox 0.7.3 convert pdf to text

PDFBox 0.7.3 convert pdf to text - c#

I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x
Please tell me what i do?
output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());
PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
output.Write(stripper.getText(doc));

Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for
If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.
Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.
For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:
PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}
If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...

Related

itext PDF link doesn't work in Microsoft Edge

I'm adding a link to a file from a pdf document (created with itext)
this way:
Chunk chunk = new Chunk(fileName, font);
chunk.SetAnchor("./relative/path/to/file");
Link works great if I open document in Google Chrome or Adobe reader.
But it doesn't work if I open my PDF in Microsoft Edge.
Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge? If yes, then how?

Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge?
If yes, then how?
Having done some tests it appears that Edge does not support relative links in PDF documents.
It does support absolute links, though, given the full URI, e.g.
chunk = new Chunk("Only ASCII chars in target. Full path.");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Attachments/1.png");
doc.Add(new Paragraph(chunk));
In contrast to other PDF viewers (Adobe Reader, Chrome, cf. your previous question in this context) it does not support URL encoding of special characters like Cyrillic ones:
chunk = new Chunk("Cyrillic chars in target. URL-encoded. Full path. NOT WORKING");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/" + WebUtility.UrlEncode("Вложения") + "/1.png");
doc.Add(new Paragraph(chunk));
But it does support the special characters in UTF-8 encoding. As UTF-8 PdfString encoding is a PDF-2.0 feature and iText 5 does not support PDF-2.0, one has to cheat a bit to inject strings in UTF-8 encoding here:
chunk = new Chunk("Cyrillic chars in target. Action manipulated. Full path.");
chunk.SetAnchor("XXX");
action = (PdfAction)chunk.Attributes[Chunk.ACTION];
action.Put(PdfName.URI, new PdfString(new UTF8Encoding().GetBytes("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Вложения/1.png")));
doc.Add(new Paragraph(chunk));
Tested with Edge 41.16299.666.0

Using C# iText 7 to flatten an XFA PDF

Is it possible to use iText 7 to flatten an XFA PDF? I'm only seeing Java documentation about it (http://developers.itextpdf.com/content/itext-7-examples/itext-7-form-examples/flatten-xfa-using-pdfxfa).
It seems like you can use iTextSharp, however to do this.
I believe it's not an AcroForm PDF because doing something similar to this answer How to flatten pdf with Itext in c#? simply created a PDF that wouldn't open properly.

It looks like you have to use iTextSharp and not iText7. Looking at the NuGet version it looks like iTextSharp is essentially the iText5 .NET version and like Bruno mentioned in the comments above, the XFA stuff simply hasn't been ported to iText7 for .NET.
The confusion stemmed from having both iText7 and iTextSharp versions in NuGet and also the trial page didn't state that the XFA worker wasn't available for the .NET version of iText7 (yet?)
I did the following to accomplish what I needed at least for a trial:
Request trial copy here: http://demo.itextsupport.com/newslicense/
You'll be emailed an xml license key, you can just place it on your desktop for now.
Create a new console application in Visual Studio
Open the Project Manager Console and type in the following and press ENTER (this will install other dependencies as well)
Install-Package itextsharp.xfaworker
Use the following code:
static void Main(string[] args)
{
ValidateLicense();
var sourcePdfPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory), "<your_xfa_pdf_file>");
var destinationPdfPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory), "output.pdf");
FlattenPDF(sourcePdfPath, destinationPdfPath);
}
private static void ValidateLicense()
{
var licenseFileLocation = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory), "itextkey.xml");
iTextSharp.license.LicenseKey.LoadLicenseFile(licenseFileLocation);
}
private static void FlattenPDF(string sourcePdfPath, string destinationPdfPath)
{
using (var sourcePdfStream = File.OpenRead(sourcePdfPath))
{
var document = new iTextSharp.text.Document();
var writer = iTextSharp.text.pdf.PdfWriter.GetInstance(document, new FileStream(destinationPdfPath, FileMode.Create));
var xfaf = new iTextSharp.tool.xml.xtra.xfa.XFAFlattener(document, writer);
sourcePdfStream.Position = 0;
xfaf.Flatten(new iTextSharp.text.pdf.PdfReader(sourcePdfStream));
document.Close();
}
}
The trial will put a huge watermark on the resulting PDF, but at least you can get it working and see how the full license should work.

For IText 7 this could be done in the following way
LicenseKey.LoadLicenseFile(#"Path of the license file");
MemoryStream dest_File = new MemoryStream();
XFAFlattener xfaFlattener = new XFAFlattener();
xfaFlattener.Flatten(new MemoryStream( File.ReadAllBytes(#"C:\\Unflattened file")), dest_File);
File.WriteAllBytes("flatten.pdf", dest_File.ToArray());

Select.Pdf load existing .pdf

The documentation for Select.Pdf for .Net v 2.22 uses the following code snippet to load an existing document. However, in the actual library there is no constructor with that takes a string as a parameter. Is anyone aware of how to load an existing .pdf? I am using the Community Edition of this product as well.
string file = Server.MapPath("~/files/doc1.pdf");
// load the pdf document
PdfDocument doc = new PdfDocument(file);
// add a new page to the document
PdfPage page = doc.AddPage();
// create a new pdf font (component standard font)
PdfFont font = doc.AddFont(PdfStandardFont.Helvetica);
font.Size = 20;
// create text element and add it to the new page
PdfTextElement text = new PdfTextElement(100, 100,
"Sample text added to an existing pdf document.", font);
page.Add(text);
// save pdf document
doc.Save(Response, false, "Sample.pdf");
// close pdf document
doc.Close();

As suggested by Jason, here are my comments as an answer:
I've downloaded the latest version (16.2) from the Select.Pdf website and the .net 4 DLLs do contain the PdfDocument(string filename) constructor, just as described in their documentation.
I've also downloaded the Community Edition and checked the same PdfDocument class. In this edition, the constructor has fewer overloads.
So I guess that the Community Edition is a much older version and you cannot apply the full version examples to the Community Edition.

Convert PDF to PDF/A3 or PDF/A-1 to PDF/A-3

I'm testing iTextSharp to generate ZUGFeRD-Files. My first step was to generate a ZUGFeRD conform file from an existing PDF/A-3 file. This was successfull by using PDFACopy and creating the necessary PDFFileSpecification.
The next step would be to generate a PDF/A-3 file from an existing PDF or PDF/A-1 file and this is the hard part.
First, when I'm trying to use PDFACopy in combination with a regular PDF (not PDF/A) im getting an error that PDFACopy can only be used with PDF/A-conform files. My first question is, how to get an PDF/A-3-conform file from a PDF with iTextSharp?
To reduce the gap, I decided to convert the PDF into PDF/A-1 file with ghostscript (cf. How to use ghostscript to convert PDF to PDF/A or PDF/X?).
This was succesfull and I tried again. Then the error "Different PDF/A version." was thrown. It seems that I can't copy from existing PDF/A-1 into a new PDF/A-3. How can I create this PDF/A-3 from an existing PDF(/A-1)? Is that even possible?
Here is my code:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(XML);
byte[] xmlBytes = Encoding.Default.GetBytes(xmlDoc.OuterXml);
Document doc = new Document();
PdfReader src_reader = new PdfReader(pdfPath);
FileStream fs = new FileStream(DEST, FileMode.Create, FileAccess.ReadWrite);
PdfACopy aCopy = new PdfACopy(doc, fs, PdfAConformanceLevel.ZUGFeRD);
doc.AddLanguage("de-DE");
doc.AddTitle("title");
doc.SetPageSize(src_reader.GetPageSizeWithRotation(1));
aCopy.SetTagged();
aCopy.UserProperties = true;
aCopy.PdfVersion = PdfCopy.VERSION_1_7;
aCopy.ViewerPreferences = PdfCopy.DisplayDocTitle;
aCopy.CreateXmpMetadata();
aCopy.XmpWriter.SetProperty(PdfAXmpWriter.zugferdSchemaNS, PdfAXmpWriter.zugferdDocumentFileName, "ZUGFeRD-invoice.xml");
//Ab hier können keine Metadaten mehr geschrieben werden
doc.Open();
ICC_Profile icc = ICC_Profile.GetInstance(new FileStream(ICM, FileMode.Open));
aCopy.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
[...add the dictionary to doc..]
aCopy.AddDocument(src_reader);
doc.Close();
One more question: addDocument works, but when I'm using copy.addPage(copy.getImportedPage(src_reader, i)), an error "the document has no pages" will be thrown. WHY?

1. Can you convert a regular PDF to a PDF/A document?
The answer is: it depends.
PDF/A is a subset of PDF and involves some obligations (e.g. all fonts must be embedded) and restrictions (e.g. no Javascript is allowed). iText can't "automatically" convert a regular PDF to a PDF/A for a number of reasons. For instance: if a font is not embedded, iText doesn't know which font to use to replace the unembedded font, nor where to find the necessary font program. Usually this requires human interaction because replacing one font by an arbitrary other font usually results in very ugly PDFs.
The answer is: it depends because some people are using iText to convert PDF to PDF/A, but this involves a lot of programming and human decisions. I see that you succeed when using GhostScript. In that case, GhostScript is making some decisions in your place. This can lead to acceptable results. In some cases, the result will not be acceptable (e.g. very odd-looking PDFs if the fonts don't match).
2. Can you convert a PDF/A-1 file to a PDF/A-3 file?
The PDF/A standard is written in such a way that old versions of the PDF/A specification are never outdated. Newer versions only add newer functionality. For instance: PDF/A-1 was based on the PDF 1.4 specification. Optional Content functionality (OCG) was introduced in PDF 1.5. The introduction of OCG is one of the differences between PDF/A-2 and PDF/A-1.
This means that every file that conforms to PDF/A-1 automatically conforms to PDF/A-2. However, a PDF/A-2 file could contain functionality that isn't supported in PDF/A-1.
3. What is the difference between PDF/A-2 and PDF/A-3?
PDF/A-2 and PDF/A-3 are identical, except for one difference: a PDF/A-3 file can have attachments that aren't PDF/A files. For instance: a PDF/A-3 file can have a Word file as attachment, an XLS file, a plain text file,... You mention ZUGFeRD: in that case, the PDF/A-3 file has at least an XML file as attachment.
Summarized:
This is a broad answer to a broad question (your question goes in many different directions, so it's hard to give you a specific answer). Why don't you use the already built-in ZUGFeRD support to create the invoices? Read ZUGFeRD, the future of invoicing for more info.

how to convert pdf file to text file using c#.net

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas

Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.

There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.