iText Acrofields Are Empty, form.getAsArray(PdfName.FIELDS) is also null - c#

I am using iText7.NET. A third party has provided PDF's with fields, the fields are present and Adobe Acrobat seems to have no issues opening and displaying the PDF, but in iText the fields collection is empty.
I've seen the answer at ItextSharp - Acrofields are empty and the related knowledge-base articles on iText's site, but the fix does not work in my case, as form.getAsArray(PdfName.FIELDS) returns null, so it cannot be added to.
Also I've checked for Xfa and that does not seem to present
XfaForm xfa = form.GetXfaForm();
xfa.IsXfaPresent() // returns false
Is it possible to add PdfName.FIELDS to the document and then populate?
Thank You

So I think I have figured out what causes the issue and have a short term fix for my particular case. In this document some fields are sub type "Link", not "Widget" and the fix code I was using (based on link above which most likely came from here https://kb.itextsupport.com/home/it7kb/faq/why-are-the-acrofields-in-my-document-empty) will fail. My fix is is to skip sub type link, although perhaps a better solution exists that would not skip Links, which I don't need.
If I don't skip Links, when the saved PDF is loaded again it fails on
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdfDoc, true);
In the lower level code in itext.forms, IterateFields() is called and within that it passes formField.GetParent() as a parameter to
PdfFormField.MakeFormField, GetParent() returns null for the Link fields so there is an exception.
Below is the RUPS hierarchy to the first subtype Link field that causes a problem
So the solution at the moment to fix my particular issue is to skip sub type links. The code is as follows
PdfReader reader = new PdfReader(pdf);
MemoryStream dest = new MemoryStream();
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader, writer);
PdfCatalog root = pdfDoc.GetCatalog();
PdfDictionary form = root.GetPdfObject().GetAsDictionary(PdfName.AcroForm);
PdfArray fields = form.GetAsArray(PdfName.Fields);
if (fields == null)
{
form.Put(PdfName.Fields, new PdfArray());
fields = form.GetAsArray(PdfName.Fields);
}
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
PdfPage page = pdfDoc.GetPage(i);
var annots = page.GetAnnotations();
for (int j = 0; j < annots.Count(); j++)
{
PdfObject o = annots[j].GetPdfObject();
PdfDictionary m = o as PdfDictionary;
string subType = m?.GetAsName(PdfName.Subtype)?.GetValue() ?? "";
if (subType != "Link")
{
fields.Add(o);
fields.SetModified();
}
}
}
pdfDoc.Close();

Related

C# - Read the content of PDF(form based) in the form of text [duplicate]

Good Morning,
I don't know, how can i read the field name form below pdf.
I used all methods for AcroFields, but all methods returns 0 or null
http://www.finanse.mf.gov.pl/documents/766655/1481810/PIT-8C(7)_v1-0E.pdf
my code:
try {
PdfReader.unethicalreading = true;
PdfReader reader = new PdfReader(new FileInputStream("/root/TestPit8/web/notmod.pdf"));
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("/root/TestPit8/web/testpdf.pdf"));
AcroFields form = stamper.getAcroFields();
form.setField("text_1", "666");
form.setField("text_2", "666");
form.setField("text_3", "666");
form.setFieldProperty("text_3", "clrfflags", TextField.PASSWORD, null);
form.setFieldProperty("text_3", "setflags", PdfAnnotation.FLAGS_PRINT, null);
form.setField("text_3", "12345678", "xxxxxxxx");
form.setFieldProperty("text_4", "textsize", new Float(12), null);
form.regenerateField("text_4");
stamper.close();
reader.close();
} catch( Exception ex) {
ex.printStackTrace();
}
Thx forhelp
The form you share is a pure XFA form. XFA stands for the XML Forms Architecture.
Please read The Best iText Questions on StackOverflow and scroll to the section entitled "Interactive forms".
These are the first two questions of this section:
How to fill out a pdf file programmatically? (AcroForm
technology)
How to fill out a pdf file programmatically? (Dynamic
XFA)
You are filling out the form as if it were based on AcroForm technology. That isn't supposed to work, is it? Your form is an XFA form!
Filling out an XFA form is explained in my book, in the XfaMovies example:
public void manipulatePdf(String src, String xml, String dest)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields form = stamper.getAcroFields();
XfaForm xfa = form.getXfa();
xfa.fillXfaForm(new FileInputStream(xml));
stamper.close();
reader.close();
}
In this case, src is a path to the original form, xml is a path to the XML data, and dest is the path of the filled out form.
If you want to read the data, you need the XfaMovie example:
This reads the full form (all the XFA):
public void readXfa(String src, String dest)
throws IOException, ParserConfigurationException, SAXException,
TransformerFactoryConfigurationError, TransformerException {
FileOutputStream os = new FileOutputStream(dest);
PdfReader reader = new PdfReader(src);
XfaForm xfa = new XfaForm(reader);
Document doc = xfa.getDomDocument();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.transform(new DOMSource(doc), new StreamResult(os));
reader.close();
}
If you only want the data, you need to examine the datasets node:
public void readData(String src, String dest)
throws IOException, ParserConfigurationException, SAXException,
TransformerFactoryConfigurationError, TransformerException {
FileOutputStream os = new FileOutputStream(dest);
PdfReader reader = new PdfReader(src);
XfaForm xfa = new XfaForm(reader);
Node node = xfa.getDatasetsNode();
NodeList list = node.getChildNodes();
for (int i = 0; i < list.getLength(); i++) {
if("data".equals(list.item(i).getLocalName())) {
node = list.item(i);
break;
}
}
list = node.getChildNodes();
for (int i = 0; i < list.getLength(); i++) {
if("movies".equals(list.item(i).getLocalName())) {
node = list.item(i);
break;
}
}
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.transform(new DOMSource(node), new StreamResult(os));
reader.close();
}
Note that I don't understand why you think there are fields such as text_1, text_2 in the form. XFA fields are easy to recognize because the contain plenty of [] characters.
Also: from the screenshot below (taken with iText RUPS), it is clear that there are no such fields in the form:
The tools are there on the iText web site. The documentation is there. Please use it!
Update:
So... instead of accepting my comprehensive answer, you decided to post a comment asking me to do your work in your place by asking where I can find example code? in spite of the fact that I provided links to XfaMovie and XfaMovies.
Well, here are two new examples for you:
ReadXFA takes xfa_form_poland.pdf and reads the data with data.xml as result.
FillXFA2 takes xfa_form_poland.pdf and fills it out with xfa_form_poland.xml resulting in xfa_form_poland_filled.pdf
Of course: I don't understand Polish, so I didn't always fill out the correct values, but now at least you have no longer a reason to ask where I can find example code?
Update 2:
In an extra comment, you claim that you can't find the NIP number (number 10 in the form) anywhere in the data structure.
This means either that you haven't examined data.xml, or that you don't understand XML.
Allow me to show the relevant part of the XML that contains the NIP number:
<Deklaracja xmlns="http://crd.gov.pl/wzor/2014/12/05/1880/" xmlns:etd="http://crd.gov.pl/xml/schematy/dziedzinowe/mf/2011/06/21/eD/DefinicjeTypy/">
....
<Podmiot2 rola="Podatnik">
<etd:OsobaFizyczna>
<etd:NIP>0123456789</etd:NIP>
<etd:ImiePierwsze>JUST TRY</etd:ImiePierwsze>
<etd:Nazwisko>DUDE</etd:Nazwisko>
<etd:DataUrodzenia>2015-02-19</etd:DataUrodzenia>
</etd:OsobaFizyczna>
</Podmiot2>
...
</Deklaracja>
In other words, the field name you're looking for is probably something like this: Deklaracja[0].Podmiot2[0].OsobaFizyczna[0].NIP[0] (whatever these words may mean, I only know one Polish word: Podpis).

Remove highlighted area in pdf using iTextSharp

I highlighted the word in pdf using the code in the answer to the following question: Highlight words in a pdf using itextsharp, not displaying highlighted word in browser
Now I want to know how to remove those highlighted rectangles using iTextSharp.
private void RemovehighlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string highLightedText)
{
PdfReader reader = new PdfReader(outputFile);
using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
if(str==highLightedText)
{
annots.Remove(i);
}
}
}
}
}
}
It removes all annotation but i want to remove particular annotation.
Suppose i highlighted united states and Patent Application Publication in page no 1, now i want to remove united states alone. I will pass the text united states.
I refered this answer. In that, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.
Getting the highlighted annotation coordinates
As the OP clarified, he actually wants to
get the highlighted annotation coordinates
to extract the text from that area, check whether it matches the phrase in question, and (if it does) remove the annotation.
As the code in question always only marks a single rectangle with each annotation and chose the rectangle to only contain the text in question, he can simply use the annotation rectangle
annotationDic.GetAsArray(PdfName.RECT)
In a more generic case (i.e. for highlight annotations starting on the end of one line and ending at the start of the next), he'd need to check the quad points
annotationDic.GetAsArray(PdfName.QUADPOINTS)
which describe a set of quadrilaterals.
E.g. in case of the sample from the referenced question (highlighting the occurrence of the word "support" on the third document page of the OP's sample PDF), the method
private void ReportHighlightPDFAnnotation(string highLightFile, int pageno)
{
PdfReader reader = new PdfReader(highLightFile);
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
Console.Write("HighLight at {0} with {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));
}
}
}
}
reports
HighLight at [224.65, 654.03, 251.08, 662.03] with [221.65, 654.03, 251.08, 654.03, 221.65, 663.03, 251.08, 663.03]
HighLight at [80.9, 574.13, 107.28, 582.13] with [77.9, 574.13, 107.28, 574.13, 77.9, 583.13, 107.28, 583.13]
HighLight at [209.3, 544.33, 235.67, 552.33] with [206.3, 544.33, 235.67, 544.33, 206.3, 553.33, 235.67, 553.33]
In particular those values are not null as the OP claims in his comment
null value only i get for PdfArray annots = pageDict.GetAsArray(PdfName.QUADPOINTS) and annotationDic.GetAsArray(PdfName.RECT)
An alternative approach
If I were the OP, I'd add private data to the annotations I create which contain the highlighted phrase. When he wants to remove the annotations for a given phrase, he can simply check that private data.
Text extraction, even from a limited area, is a very costly operation as the page content stream and a possible multitude of form xobject streams have to be parsed.
A warning on loop design
The OP wants to remove the annotations in this loop:
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
annots.Remove(i);
}
}
The problem: If he is at index i and removes this annotation, the former i+1st annotation becomes the ith one. As the next annotation to check, though, is the now i+1st, that former i+1st annotation will not be checked or removed.

How to add in reply to annotation using iTextSharp

I am trying to add a sticky note reply to in pdf using iTextSharp. I am able to create a new annotation in the pdf. But i cannot link it as child of an already existing annotation. I copied most of the properties in parent to its child. I copied it by analyzing the properties of a reply, by manually adding a reply from Adobe Reader. What I am missing is the property /IRT. It needs a reference to the parent popup. Like /IRT 16 0 R.
Below is the code i am trying.
private void annotateReplyPdf()
{
string outputFile = #"D:\temp\temp.pdf";
// Creating iTextSharp.text.pdf.PdfReader object to read the Existing PDF Document
using (PdfReader reader = new PdfReader(FILE_NAME))
{
using (FileStream fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
// Creating iTextSharp.text.pdf.PdfStamper object to write Data from iTextSharp.text.pdf.PdfReader object to FileStream object
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
//get page 1
PdfDictionary pageDic = reader.GetPageN(1);
//get annotations in page 1
PdfArray pageAnnotsArray = pageDic.GetAsArray(PdfName.ANNOTS);
if (pageAnnotsArray != null)
{
PdfDictionary curAnnotDic = pageAnnotsArray.GetAsDict(0);
PdfArray rect = curAnnotDic.GetAsArray(PdfName.RECT);
Rectangle rectangle = new Rectangle(float.Parse(rect[0].ToString()), float.Parse(rect[1].ToString()), float.Parse(rect[2].ToString()), float.Parse(rect[3].ToString()));
PdfAnnotation newAnnot = new PdfAnnotation(stamper.Writer, rectangle);
newAnnot.Title = "john.conor";
var dtNow = DateTime.Now;
newAnnot.Put(PdfName.C, curAnnotDic.Get(PdfName.C));
newAnnot.Put(PdfName.CONTENTS, new PdfString("Reply using prog"));
newAnnot.Put(PdfName.CREATIONDATE, new PdfDate(dtNow));
// newAnnot.Put(PdfName.IRT, curAnnotDic.); stuck here
newAnnot.Put(PdfName.M, new PdfDate(dtNow));
newAnnot.Put(PdfName.NAME, curAnnotDic.Get(PdfName.NAME));
newAnnot.Put(PdfName.RC, curAnnotDic.Get(PdfName.RC));
newAnnot.Put(PdfName.SUBTYPE, PdfName.TEXT);
newAnnot.Put(PdfName.SUBJECT, curAnnotDic.Get(PdfName.SUBJECT));
stamper.AddAnnotation(newAnnot, 1);
}
}
}
}
}
The methods I have used might not be accurate or efficient, as most of the code were found by trial and error and checking other similar examples(also checking the pdf specification).
Can somebody please fill that code, which does the magic.
note: SO question doesn't provide a code for the answer.
Please take a look at the AddInReplyTo example.
We have a file named hello_sticky_note.pdf that looks like this:
I am going to skip the method to detect the annotation of the sticky note (in your question, you already have this code). In my example, I know that this annotation is the first entry in the /Annots array (the annotation with index 0).
This is how I'm going to add an "in reply to" annotation:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary page = reader.getPageN(1);
PdfArray annots = page.getAsArray(PdfName.ANNOTS);
PdfDictionary sticky = annots.getAsDict(0);
PdfArray stickyRect = sticky.getAsArray(PdfName.RECT);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
PdfWriter writer = stamper.getWriter();
Rectangle stickyRectangle = new Rectangle(
stickyRect.getAsNumber(0).floatValue(), stickyRect.getAsNumber(1).floatValue(),
stickyRect.getAsNumber(2).floatValue(), stickyRect.getAsNumber(3).floatValue()
);
PdfAnnotation replySticky = PdfAnnotation.createText(
writer, stickyRectangle, "Reply", "Hello PDF", true, "Comment");
replySticky.put(PdfName.IRT, annots.getAsIndirectObject(0));
stamper.addAnnotation(replySticky, 1);
stamper.close();
}
Just like you, I get the original annotation (in my code, it's named sticky) and I get the position of that annotation (stickyRect). I create a stickyRectangle object in a slightly different way than you do (my way is better, but that doesn't matter too much) and I use that stickyRectangle to create a new PdfAnnotation named replySticky.
That's what you already have. Now I add the missing part:
replySticky.Put(PdfName.IRT, annots.GetAsIndirectObject(0));
In your code, you add the annotation dictionary, but what you actually need is the reference to that dictionary.
The resulting PDF looks like hello_in_reply_to.pdf:

Edit DirectContent of iTextSharp PdfSmartCopy class

At my work sometimes I have to merge from few to few hundreds pdf files. All the time I've been using Writer and ImportedPages classes. But when I have merged all files into one, file size becomes enormous, sum of all merged files sizes, because fonts being attached to every page, and not reused (fonts are embedded to every page, not whole document).
Not very long time ago I found out about PdfSmartCopy class, which reuses embedded fonts and images. And here the problem kicks in. Very often, before merging files together, I have to add additional content to them (images, text). For this purpose I usually use PdfContentByte from Writer object.
Document doc = new Document();
PdfWriter writer = PdfWriter.GetInstance(doc, new FileStream("C:\test.pdf", FileMode.Create));
PdfContentByte cb = writer.DirectContent;
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
When I do similar thing with PdfSmartCopy object, pages are merged, but no additional content being added. Full code of my test with PdfSmartCopy:
using (Document doc = new Document())
{
using (PdfSmartCopy copy = new PdfSmartCopy(doc, new FileStream(Path.GetDirectoryName(pdfPath[0]) + "\\testas.pdf", FileMode.Create)))
{
doc.Open();
PdfContentByte cb = copy.DirectContent;
for (int i = 0; i < pdfPath.Length; i++)
{
PdfReader reader = new PdfReader(pdfPath[i]);
for (int ii = 0; ii < reader.NumberOfPages; ii++)
{
PdfImportedPage import = copy.GetImportedPage(reader, ii + 1);
copy.AddPage(import);
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
doc.NewPage();// net nesessary line
//ColumnText col = new ColumnText(cb);
//col.SetSimpleColumn(100,100,500,500);
//col.AddText(new Chunk("wdasdasd", PdfFontManager.GetFont(#"C:\Windows\Fonts\arial.ttf", 20)));
//col.Go();
}
}
}
}
}
Now I have few questions:
Is it possible to edit PdfSmartCopy object's DirectContent?
If not, is there another way to merge multiple pdf files into one not increasing its size dramatically and still being able to add additional content to pages while merging?
First this: using PdfWriter/PdfImportedPage is not a good idea. You throw away all interactive features! Being the author of iText, it's very frustrating to so many people making the same mistake in spite of the fact that I wrote two books about this, and in spite of the fact that I convinced my publisher to offer one of the most important chapters for free: http://www.manning.com/lowagie2/samplechapter6.pdf
Is my writing really that bad? Or is there another reason why people keep on merging documents using PdfWriter/PdfImportedPage?
As for your specific questions, here are the answers:
Yes. Download the sample chapter and search the PDF file for PageStamp.
Only if you create the PDF in two passes. For instance: create the huge PDF first, then reduce the size by passing it through PdfCopy; or create the merged PDF first with PdfCopy, then add the extra content in a second pass using PdfStamper.
Code after using Bruno Lowagie answer
for (int i = 0; i < pdfPath.Length; i++)
{
PdfReader reader = new PdfReader(pdfPath[i]);
PdfImportedPage page;
PdfSmartCopy.PageStamp stamp;
for (int ii = 0; ii < reader.NumberOfPages; ii++)
{
page = copy.GetImportedPage(reader, ii + 1);
stamp = copy.CreatePageStamp(page);
PdfContentByte cb = stamp.GetOverContent();
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
stamp.AlterContents(); // don't forget to add this line
copy.AddPage(page);
}
}
2.Only if you create the PDF in two passes. For instance: create the huge PDF first, then reduce the size by passing it through PdfCopy; or create the merged PDF first with PdfCopy, then add the extra content in a second pass using PdfStamper.
It is much more difficult to use the PdfStamper with a second pass. When your working with lots of data it's far easier to create 1 pdf stamp then append.
PdfCopyFields had worked well for this. Now it doesn't work as of the 5.4.4.0 release which is why I'm here.

How do I extract attachments from a pdf file?

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?
iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.
There are two ways to embed files in a PDF:
In a File Annotation
At the document level "EmbeddedFiles".
Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).
So to list all the files at the document level, one would write code (in Java) as such:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfString name = embeddedFiles.getAsString(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
PRStream stream = null;
if (streams.contains(PdfName.UF))
stream = (PRStream)streams.getAsStream(PdfName.UF);
else
stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility
if (stream != null) {
files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
}
}
This is an old question, nonetheless I think my alternative solution (using PDF Clown) may be of some interest as it's way much cleaner (and more complete, as it iterates both at document and page level) than the code fragments previously proposed:
using org.pdfclown.bytes;
using org.pdfclown.documents;
using org.pdfclown.documents.files;
using org.pdfclown.documents.interaction.annotations;
using org.pdfclown.objects;
using System;
using System.Collections.Generic;
void ExtractAttachments(string pdfPath)
{
Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>();
using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath))
{
Document document = file.Document;
// 1. Embedded files (document level).
foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles)
{EvaluateDataFile(attachments, entry.Value);}
// 2. File attachments (page level).
foreach(Page page in document.Pages)
{
foreach(Annotation annotation in page.Annotations)
{
if(annotation is FileAttachment)
{EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);}
}
}
}
}
void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile)
{
if(dataFile is FullFileSpecification)
{
EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile;
if(embeddedFile != null)
{attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();}
}
}
Note that you don't have to bother with null pointer exceptions as PDF Clown provides all the necessary abstraction and automation to ensure smooth model traversal.
PDF Clown is an LGPL 3 library, implemented both in Java and .NET platforms (I'm its lead developer): if you want to get it a try, I suggest you to check out its SVN repository on sourceforge.net as it keeps evolving.
Look for ABCpdf-Library, very easy and fast in my opinion.
What I got working is slightly different then anything else I have seen online.
So, just in case, I thought I would post this here to help someone else. I had to go through many different iterations to figure out - the hard way - what I needed to get it to work.
I am merging two PDFs into a third PDF, where one of the first two PDFs may have file attachments that need to be carried over into the third PDF. I am working completely in streams with ASP.NET, C# 4.0, ITextSharp 5.1.2.0.
// Extract Files from Submit PDF
Dictionary<string, byte[]> files = new Dictionary<string, byte[]>();
PdfDictionary names;
PdfDictionary embeddedFiles;
PdfArray fileSpecs;
int eFLength = 0;
names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream
if (names != null)
{
embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null
if (embeddedFiles != null)
{
fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null
if (fileSpecs != null)
{
eFLength = fileSpecs.Size;
for (int i = 0; i < eFLength; i++)
{
i++; //objects are in pairs and only want odd objects (1,3,5...)
PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null
if (fileSpec != null)
{
PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys)
{
PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));
if (stream != null)
{
files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream));
}
}
}
}
}
}
}
You may try Aspose.Pdf.Kit for .NET. The PdfExtractor class allows you to extract attachments with the help of two methods: ExtractAttachment and GetAttachment. Please see an example of attachment extraction.
Disclosure: I work as developer evangelist at Aspose.

Categories

Resources