itext PDF link doesn't work in Microsoft Edge - c#

I'm adding a link to a file from a pdf document (created with itext)
this way:
Chunk chunk = new Chunk(fileName, font);
chunk.SetAnchor("./relative/path/to/file");
Link works great if I open document in Google Chrome or Adobe reader.
But it doesn't work if I open my PDF in Microsoft Edge.
Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge? If yes, then how?

Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge?
If yes, then how?
Having done some tests it appears that Edge does not support relative links in PDF documents.
It does support absolute links, though, given the full URI, e.g.
chunk = new Chunk("Only ASCII chars in target. Full path.");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Attachments/1.png");
doc.Add(new Paragraph(chunk));
In contrast to other PDF viewers (Adobe Reader, Chrome, cf. your previous question in this context) it does not support URL encoding of special characters like Cyrillic ones:
chunk = new Chunk("Cyrillic chars in target. URL-encoded. Full path. NOT WORKING");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/" + WebUtility.UrlEncode("Вложения") + "/1.png");
doc.Add(new Paragraph(chunk));
But it does support the special characters in UTF-8 encoding. As UTF-8 PdfString encoding is a PDF-2.0 feature and iText 5 does not support PDF-2.0, one has to cheat a bit to inject strings in UTF-8 encoding here:
chunk = new Chunk("Cyrillic chars in target. Action manipulated. Full path.");
chunk.SetAnchor("XXX");
action = (PdfAction)chunk.Attributes[Chunk.ACTION];
action.Put(PdfName.URI, new PdfString(new UTF8Encoding().GetBytes("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Вложения/1.png")));
doc.Add(new Paragraph(chunk));
Tested with Edge 41.16299.666.0

Related

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

iText 7.0.4.0 - PdfWriter produces corrupted PDF for certain PDF file inputs

I'm having an issue where PdfWriter from iText 7.0.4.0 (.NET 4.5.1) produces corrupted PDF documents for certain input PDF files.
To elaborate, PDF files with well-formed paragraphs have no issues. However, if the input PDF contains irregular contents (for a lack of better words; please refer to the samples in Google drive), PdfWriter produces corrupted PDF files; by corrupted, I mean that the file can be opened, but it shows a blank page with extremely high zoom (in Adobe Reader XI). Corrupted samples have also been provided in the aforementioned Google drive link.
Sample code:
using (var pdfReader = new PdfReader("sample1_input.pdf"))
{
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter("sample1_corrupted_output.pdf"));
// Trying to highlight a part of PDF by referencing this example:
// https://developers.itextpdf.com/examples/stamping-content-existing-pdfs/clone-highlighting-text
// Commented out for now because PdfWriter is producing corrupted PDF documents for the samples and similar PDF files.
//PdfCanvas canvas = new PdfCanvas(pdfDoc.GetFirstPage());
//canvas.SetExtGState(new PdfExtGState().SetFillOpacity(0.1f));
//canvas.SaveState();
//canvas.SetFillColor(Color.YELLOW);
//canvas.Rectangle(100, 100, 200, 200);
//canvas.Fill();
//canvas.RestoreState();
pdfDoc.Close(); // Corrupted PDF file is produced, even without highlighting.
}
One "interesting" thing I noticed is that if I provide "new StampingProperties().UseAppendMode()" as the third parameter of PdfDocument (without the highlighting code), PdfWriter spits out the original file (although a few kb larger than the original for some reason). However, PdfWriter goes back to producing corrupted PDFs when the highlighting code is un-commented.
Link to sample files:
https://drive.google.com/open?id=0B3NPOZswWocQV09KMW5fbFVyUm8
sample1_input.pdf (input sample #1) -> sample1_corrupted_output.pdf (corrupted output)
sample2_input.pdf (input sample #2) -> sample2_corrupted_output.pdf (corrupted output)
Please kindly give some advice.
The cause of this corruption is an unusual structure of the page tree of the PDFs in question:
It is unusual in two ways:
It has a subtree without any page objects (dictionaries 17 an 21).
It has a node with mixed child node types (dictionary 10 has a Page child 3 and a Pages child 17)
If one removes the page-less subtree (by removing object 17 from the Kids of object 10), both quirks are removed and the code does not fail anymore.
While both quirks are weird, I don't see anything in ISO 32000-1 (unfortunately I don't have a copy of ISO 32000-2 yet) indicating that these unusual structures are explicitly forbidden. Thus, I would assume this is an iText bug.
I could reproduce the problem with iText 7.0.4 for Java but not the current development SNAPSHOT of 7.0.5.
Indeed, there is a commit dated 2017-09-19 10:03:37 [c0b35f0] described as "Fix bugs in pages tree rebuilding" with differences in the PdfPagesTree class in a code block described as "handle mix of PdfPage and PdfPages". Thus, the issue appears to be known and already fixed.
You may either wait for the 7.0.5 release or look for hotfixes 7.0.4.x.

Convert PDF to PDF/A3 or PDF/A-1 to PDF/A-3

I'm testing iTextSharp to generate ZUGFeRD-Files. My first step was to generate a ZUGFeRD conform file from an existing PDF/A-3 file. This was successfull by using PDFACopy and creating the necessary PDFFileSpecification.
The next step would be to generate a PDF/A-3 file from an existing PDF or PDF/A-1 file and this is the hard part.
First, when I'm trying to use PDFACopy in combination with a regular PDF (not PDF/A) im getting an error that PDFACopy can only be used with PDF/A-conform files. My first question is, how to get an PDF/A-3-conform file from a PDF with iTextSharp?
To reduce the gap, I decided to convert the PDF into PDF/A-1 file with ghostscript (cf. How to use ghostscript to convert PDF to PDF/A or PDF/X?).
This was succesfull and I tried again. Then the error "Different PDF/A version." was thrown. It seems that I can't copy from existing PDF/A-1 into a new PDF/A-3. How can I create this PDF/A-3 from an existing PDF(/A-1)? Is that even possible?
Here is my code:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(XML);
byte[] xmlBytes = Encoding.Default.GetBytes(xmlDoc.OuterXml);
Document doc = new Document();
PdfReader src_reader = new PdfReader(pdfPath);
FileStream fs = new FileStream(DEST, FileMode.Create, FileAccess.ReadWrite);
PdfACopy aCopy = new PdfACopy(doc, fs, PdfAConformanceLevel.ZUGFeRD);
doc.AddLanguage("de-DE");
doc.AddTitle("title");
doc.SetPageSize(src_reader.GetPageSizeWithRotation(1));
aCopy.SetTagged();
aCopy.UserProperties = true;
aCopy.PdfVersion = PdfCopy.VERSION_1_7;
aCopy.ViewerPreferences = PdfCopy.DisplayDocTitle;
aCopy.CreateXmpMetadata();
aCopy.XmpWriter.SetProperty(PdfAXmpWriter.zugferdSchemaNS, PdfAXmpWriter.zugferdDocumentFileName, "ZUGFeRD-invoice.xml");
//Ab hier können keine Metadaten mehr geschrieben werden
doc.Open();
ICC_Profile icc = ICC_Profile.GetInstance(new FileStream(ICM, FileMode.Open));
aCopy.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
[...add the dictionary to doc..]
aCopy.AddDocument(src_reader);
doc.Close();
One more question: addDocument works, but when I'm using copy.addPage(copy.getImportedPage(src_reader, i)), an error "the document has no pages" will be thrown. WHY?
1. Can you convert a regular PDF to a PDF/A document?
The answer is: it depends.
PDF/A is a subset of PDF and involves some obligations (e.g. all fonts must be embedded) and restrictions (e.g. no Javascript is allowed). iText can't "automatically" convert a regular PDF to a PDF/A for a number of reasons. For instance: if a font is not embedded, iText doesn't know which font to use to replace the unembedded font, nor where to find the necessary font program. Usually this requires human interaction because replacing one font by an arbitrary other font usually results in very ugly PDFs.
The answer is: it depends because some people are using iText to convert PDF to PDF/A, but this involves a lot of programming and human decisions. I see that you succeed when using GhostScript. In that case, GhostScript is making some decisions in your place. This can lead to acceptable results. In some cases, the result will not be acceptable (e.g. very odd-looking PDFs if the fonts don't match).
2. Can you convert a PDF/A-1 file to a PDF/A-3 file?
The PDF/A standard is written in such a way that old versions of the PDF/A specification are never outdated. Newer versions only add newer functionality. For instance: PDF/A-1 was based on the PDF 1.4 specification. Optional Content functionality (OCG) was introduced in PDF 1.5. The introduction of OCG is one of the differences between PDF/A-2 and PDF/A-1.
This means that every file that conforms to PDF/A-1 automatically conforms to PDF/A-2. However, a PDF/A-2 file could contain functionality that isn't supported in PDF/A-1.
3. What is the difference between PDF/A-2 and PDF/A-3?
PDF/A-2 and PDF/A-3 are identical, except for one difference: a PDF/A-3 file can have attachments that aren't PDF/A files. For instance: a PDF/A-3 file can have a Word file as attachment, an XLS file, a plain text file,... You mention ZUGFeRD: in that case, the PDF/A-3 file has at least an XML file as attachment.
Summarized:
This is a broad answer to a broad question (your question goes in many different directions, so it's hard to give you a specific answer). Why don't you use the already built-in ZUGFeRD support to create the invoices? Read ZUGFeRD, the future of invoicing for more info.

PDFBox 0.7.3 convert pdf to text

I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x
Please tell me what i do?
output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());
PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
output.Write(stripper.getText(doc));
Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for
If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.
Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.
For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:
PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}
If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...

how to convert pdf file to text file using c#.net

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas
Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.
There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Categories

Resources