Remove highlighted area in pdf using iTextSharp

Remove highlighted area in pdf using iTextSharp - c#

I highlighted the word in pdf using the code in the answer to the following question: Highlight words in a pdf using itextsharp, not displaying highlighted word in browser
Now I want to know how to remove those highlighted rectangles using iTextSharp.
private void RemovehighlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string highLightedText)
{
PdfReader reader = new PdfReader(outputFile);
using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
if(str==highLightedText)
{
annots.Remove(i);
}
}
}
}
}
}
It removes all annotation but i want to remove particular annotation.
Suppose i highlighted united states and Patent Application Publication in page no 1, now i want to remove united states alone. I will pass the text united states.
I refered this answer. In that, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

Getting the highlighted annotation coordinates
As the OP clarified, he actually wants to
get the highlighted annotation coordinates
to extract the text from that area, check whether it matches the phrase in question, and (if it does) remove the annotation.
As the code in question always only marks a single rectangle with each annotation and chose the rectangle to only contain the text in question, he can simply use the annotation rectangle
annotationDic.GetAsArray(PdfName.RECT)
In a more generic case (i.e. for highlight annotations starting on the end of one line and ending at the start of the next), he'd need to check the quad points
annotationDic.GetAsArray(PdfName.QUADPOINTS)
which describe a set of quadrilaterals.
E.g. in case of the sample from the referenced question (highlighting the occurrence of the word "support" on the third document page of the OP's sample PDF), the method
private void ReportHighlightPDFAnnotation(string highLightFile, int pageno)
{
PdfReader reader = new PdfReader(highLightFile);
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
Console.Write("HighLight at {0} with {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));
}
}
}
}
reports
HighLight at [224.65, 654.03, 251.08, 662.03] with [221.65, 654.03, 251.08, 654.03, 221.65, 663.03, 251.08, 663.03]
HighLight at [80.9, 574.13, 107.28, 582.13] with [77.9, 574.13, 107.28, 574.13, 77.9, 583.13, 107.28, 583.13]
HighLight at [209.3, 544.33, 235.67, 552.33] with [206.3, 544.33, 235.67, 544.33, 206.3, 553.33, 235.67, 553.33]
In particular those values are not null as the OP claims in his comment
null value only i get for PdfArray annots = pageDict.GetAsArray(PdfName.QUADPOINTS) and annotationDic.GetAsArray(PdfName.RECT)
An alternative approach
If I were the OP, I'd add private data to the annotations I create which contain the highlighted phrase. When he wants to remove the annotations for a given phrase, he can simply check that private data.
Text extraction, even from a limited area, is a very costly operation as the page content stream and a possible multitude of form xobject streams have to be parsed.
A warning on loop design
The OP wants to remove the annotations in this loop:
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
annots.Remove(i);
}
}
The problem: If he is at index i and removes this annotation, the former i+1st annotation becomes the ith one. As the next annotation to check, though, is the now i+1st, that former i+1st annotation will not be checked or removed.

Related

Text extraction using itext7: garbage characters for some pdf documents

I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}

iText Acrofields Are Empty, form.getAsArray(PdfName.FIELDS) is also null

I am using iText7.NET. A third party has provided PDF's with fields, the fields are present and Adobe Acrobat seems to have no issues opening and displaying the PDF, but in iText the fields collection is empty.
I've seen the answer at ItextSharp - Acrofields are empty and the related knowledge-base articles on iText's site, but the fix does not work in my case, as form.getAsArray(PdfName.FIELDS) returns null, so it cannot be added to.
Also I've checked for Xfa and that does not seem to present
XfaForm xfa = form.GetXfaForm();
xfa.IsXfaPresent() // returns false
Is it possible to add PdfName.FIELDS to the document and then populate?
Thank You

So I think I have figured out what causes the issue and have a short term fix for my particular case. In this document some fields are sub type "Link", not "Widget" and the fix code I was using (based on link above which most likely came from here https://kb.itextsupport.com/home/it7kb/faq/why-are-the-acrofields-in-my-document-empty) will fail. My fix is is to skip sub type link, although perhaps a better solution exists that would not skip Links, which I don't need.
If I don't skip Links, when the saved PDF is loaded again it fails on
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdfDoc, true);
In the lower level code in itext.forms, IterateFields() is called and within that it passes formField.GetParent() as a parameter to
PdfFormField.MakeFormField, GetParent() returns null for the Link fields so there is an exception.
Below is the RUPS hierarchy to the first subtype Link field that causes a problem
So the solution at the moment to fix my particular issue is to skip sub type links. The code is as follows
PdfReader reader = new PdfReader(pdf);
MemoryStream dest = new MemoryStream();
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader, writer);
PdfCatalog root = pdfDoc.GetCatalog();
PdfDictionary form = root.GetPdfObject().GetAsDictionary(PdfName.AcroForm);
PdfArray fields = form.GetAsArray(PdfName.Fields);
if (fields == null)
{
form.Put(PdfName.Fields, new PdfArray());
fields = form.GetAsArray(PdfName.Fields);
}
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
PdfPage page = pdfDoc.GetPage(i);
var annots = page.GetAnnotations();
for (int j = 0; j < annots.Count(); j++)
{
PdfObject o = annots[j].GetPdfObject();
PdfDictionary m = o as PdfDictionary;
string subType = m?.GetAsName(PdfName.Subtype)?.GetValue() ?? "";
if (subType != "Link")
{
fields.Add(o);
fields.SetModified();
}
}
}
pdfDoc.Close();

Merging Word documents using WordProcessingDocument

I am currently working on a program in which a user should be able to merge several Word documents into one, without losing any formatting, headers and so on. The documents should simply stack up, one after another, without any changes.
Here is my current code:
public virtual Byte[] MergeWordFiles(IEnumerable<SendData> sourceFiles)
{
int f = 0;
// If only one Word document then skip merge.
if (sourceFiles.Count() == 1)
{
return sourceFiles.First().File;
}
else
{
MemoryStream destinationFile = new MemoryStream();
// Add first file
var firstFile = sourceFiles.First().File;
destinationFile.Write(firstFile, 0, firstFile.Length);
destinationFile.Position = 0;
int pointer = 1;
byte[] ret;
// Add the rest of the files
try
{
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(destinationFile, true))
{
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
for (pointer = 1; pointer < sourceFiles.Count(); pointer++)
{
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(sourceFiles.ElementAt(pointer).File), true);
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
newBody.Add(XElement.Parse(new DocumentFormat.OpenXml.Wordprocessing.Paragraph(new Run(new Break { Type = BreakValues.Page })).OuterXml));
newBody.Add(tempBody);
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
}
}
}
catch (OpenXmlPackageException oxmle)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), oxmle);
}
catch (Exception e)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), e);
}
finally
{
ret = destinationFile.ToArray();
destinationFile.Close();
destinationFile.Dispose();
}
return ret;
}
}
The problem here is that the formatting is copied from the first document and applied to all the rest, meaning that for instance a different header in the second document will be ignored. How do I prevent this?
I have been looking in to breaking the document in to sections using SectionMarkValues.NextPage, as well as using altChunk.
The problem with the latter is altChunk does not seem to be able to handle a MemoryStream into its "FeedData" method.

DocIO is a .NET library that can read, write, merge and render Word 2003/2007/2010/2013/2016 files. The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Step 1: Create a console application
Step 2: Add reference to Syncfusion.DocIO.Base, Syncfusion.Compression.Base and Syncfusion.OfficeChart.Base; You can add these reference to your project using NuGet also.
Step 3: Copy & paste the following code snippet.
This code snippet will produce the document as per your requirement; each input Word document will get merged with its original formatting, styles and headers/footer.
using Syncfusion.DocIO.DLS;
using Syncfusion.DocIO;
using System.IO;
namespace DocIO_MergeDocument
{
class Program
{
static void Main(string[] args)
{
//Boolean to indicate whether any of the input document has different odd and even headers as true
bool isDifferentOddAndEvenPagesEnabled = false;
// Creating a new document.
using (WordDocument mergedDocument = new WordDocument())
{
//Get the files from input directory
DirectoryInfo dirInfo = new DirectoryInfo(System.Environment.CurrentDirectory + #"\..\..\Data");
FileInfo[] fileInfo = dirInfo.GetFiles();
for (int i = 0; i < fileInfo.Length; i++)
{
if (fileInfo[i].Extension == ".doc" || fileInfo[i].Extension == ".docx")
{
using (WordDocument sourceDocument = new WordDocument(fileInfo[i].FullName))
{
//Check whether the document has different odd and even header/footer
if (!isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in sourceDocument.Sections)
{
isDifferentOddAndEvenPagesEnabled = section.PageSetup.DifferentOddAndEvenPages;
if (isDifferentOddAndEvenPagesEnabled)
break;
}
}
//Sets the breakcode of First section of source document as NoBreak to avoid imported from a new page
sourceDocument.Sections[0].BreakCode = SectionBreakCode.EvenPage;
//Imports the contents of source document at the end of merged document
mergedDocument.ImportContent(sourceDocument, ImportOptions.KeepSourceFormatting);
}
}
}
//if any of the input document has different odd and even headers as true then
//Copy the content of the odd header/foort and add the copied content into the even header/footer
if (isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in mergedDocument.Sections)
{
section.PageSetup.DifferentOddAndEvenPages = true;
if (section.HeadersFooters.OddHeader.Count > 0 && section.HeadersFooters.EvenHeader.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddHeader.Count; i++)
section.HeadersFooters.EvenHeader.ChildEntities.Add(section.HeadersFooters.OddHeader.ChildEntities[i].Clone());
}
if (section.HeadersFooters.OddFooter.Count > 0 && section.HeadersFooters.EvenFooter.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddFooter.Count; i++)
section.HeadersFooters.EvenFooter.ChildEntities.Add(section.HeadersFooters.OddFooter.ChildEntities[i].Clone());
}
}
}
//If there is no document to merge then add empty section with empty paragraph
if (mergedDocument.Sections.Count == 0)
mergedDocument.EnsureMinimal();
//Saves the document in the given name and format
mergedDocument.Save("result.docx", FormatType.Docx);
}
}
}
}
Downloadable Demo
Note: There is a Word document (not section) level settings for
applying different header/footer for odd and even pages. Each input
document can have different values for this property. if any of the
input document has different odd and even header/footer as true, it
will affect the visual appearance of header/footer in the resultant
document. Hence, if any of the input document has different odd and
even header/footer, then the resultant Word document will have been
replaced with the odd header/footer contents.
For further information about DocIO, please refer our help documentation
Note: I work for Syncfusion

How to add in reply to annotation using iTextSharp

I am trying to add a sticky note reply to in pdf using iTextSharp. I am able to create a new annotation in the pdf. But i cannot link it as child of an already existing annotation. I copied most of the properties in parent to its child. I copied it by analyzing the properties of a reply, by manually adding a reply from Adobe Reader. What I am missing is the property /IRT. It needs a reference to the parent popup. Like /IRT 16 0 R.
Below is the code i am trying.
private void annotateReplyPdf()
{
string outputFile = #"D:\temp\temp.pdf";
// Creating iTextSharp.text.pdf.PdfReader object to read the Existing PDF Document
using (PdfReader reader = new PdfReader(FILE_NAME))
{
using (FileStream fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
// Creating iTextSharp.text.pdf.PdfStamper object to write Data from iTextSharp.text.pdf.PdfReader object to FileStream object
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
//get page 1
PdfDictionary pageDic = reader.GetPageN(1);
//get annotations in page 1
PdfArray pageAnnotsArray = pageDic.GetAsArray(PdfName.ANNOTS);
if (pageAnnotsArray != null)
{
PdfDictionary curAnnotDic = pageAnnotsArray.GetAsDict(0);
PdfArray rect = curAnnotDic.GetAsArray(PdfName.RECT);
Rectangle rectangle = new Rectangle(float.Parse(rect[0].ToString()), float.Parse(rect[1].ToString()), float.Parse(rect[2].ToString()), float.Parse(rect[3].ToString()));
PdfAnnotation newAnnot = new PdfAnnotation(stamper.Writer, rectangle);
newAnnot.Title = "john.conor";
var dtNow = DateTime.Now;
newAnnot.Put(PdfName.C, curAnnotDic.Get(PdfName.C));
newAnnot.Put(PdfName.CONTENTS, new PdfString("Reply using prog"));
newAnnot.Put(PdfName.CREATIONDATE, new PdfDate(dtNow));
// newAnnot.Put(PdfName.IRT, curAnnotDic.); stuck here
newAnnot.Put(PdfName.M, new PdfDate(dtNow));
newAnnot.Put(PdfName.NAME, curAnnotDic.Get(PdfName.NAME));
newAnnot.Put(PdfName.RC, curAnnotDic.Get(PdfName.RC));
newAnnot.Put(PdfName.SUBTYPE, PdfName.TEXT);
newAnnot.Put(PdfName.SUBJECT, curAnnotDic.Get(PdfName.SUBJECT));
stamper.AddAnnotation(newAnnot, 1);
}
}
}
}
}
The methods I have used might not be accurate or efficient, as most of the code were found by trial and error and checking other similar examples(also checking the pdf specification).
Can somebody please fill that code, which does the magic.
note: SO question doesn't provide a code for the answer.

Please take a look at the AddInReplyTo example.
We have a file named hello_sticky_note.pdf that looks like this:
I am going to skip the method to detect the annotation of the sticky note (in your question, you already have this code). In my example, I know that this annotation is the first entry in the /Annots array (the annotation with index 0).
This is how I'm going to add an "in reply to" annotation:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary page = reader.getPageN(1);
PdfArray annots = page.getAsArray(PdfName.ANNOTS);
PdfDictionary sticky = annots.getAsDict(0);
PdfArray stickyRect = sticky.getAsArray(PdfName.RECT);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
PdfWriter writer = stamper.getWriter();
Rectangle stickyRectangle = new Rectangle(
stickyRect.getAsNumber(0).floatValue(), stickyRect.getAsNumber(1).floatValue(),
stickyRect.getAsNumber(2).floatValue(), stickyRect.getAsNumber(3).floatValue()
);
PdfAnnotation replySticky = PdfAnnotation.createText(
writer, stickyRectangle, "Reply", "Hello PDF", true, "Comment");
replySticky.put(PdfName.IRT, annots.getAsIndirectObject(0));
stamper.addAnnotation(replySticky, 1);
stamper.close();
}
Just like you, I get the original annotation (in my code, it's named sticky) and I get the position of that annotation (stickyRect). I create a stickyRectangle object in a slightly different way than you do (my way is better, but that doesn't matter too much) and I use that stickyRectangle to create a new PdfAnnotation named replySticky.
That's what you already have. Now I add the missing part:
replySticky.Put(PdfName.IRT, annots.GetAsIndirectObject(0));
In your code, you add the annotation dictionary, but what you actually need is the reference to that dictionary.
The resulting PDF looks like hello_in_reply_to.pdf:

How do I extract attachments from a pdf file?

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?

iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.
There are two ways to embed files in a PDF:
In a File Annotation
At the document level "EmbeddedFiles".
Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).
So to list all the files at the document level, one would write code (in Java) as such:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfString name = embeddedFiles.getAsString(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
PRStream stream = null;
if (streams.contains(PdfName.UF))
stream = (PRStream)streams.getAsStream(PdfName.UF);
else
stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility
if (stream != null) {
files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
}
}

This is an old question, nonetheless I think my alternative solution (using PDF Clown) may be of some interest as it's way much cleaner (and more complete, as it iterates both at document and page level) than the code fragments previously proposed:
using org.pdfclown.bytes;
using org.pdfclown.documents;
using org.pdfclown.documents.files;
using org.pdfclown.documents.interaction.annotations;
using org.pdfclown.objects;
using System;
using System.Collections.Generic;
void ExtractAttachments(string pdfPath)
{
Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>();
using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath))
{
Document document = file.Document;
// 1. Embedded files (document level).
foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles)
{EvaluateDataFile(attachments, entry.Value);}
// 2. File attachments (page level).
foreach(Page page in document.Pages)
{
foreach(Annotation annotation in page.Annotations)
{
if(annotation is FileAttachment)
{EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);}
}
}
}
}
void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile)
{
if(dataFile is FullFileSpecification)
{
EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile;
if(embeddedFile != null)
{attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();}
}
}
Note that you don't have to bother with null pointer exceptions as PDF Clown provides all the necessary abstraction and automation to ensure smooth model traversal.
PDF Clown is an LGPL 3 library, implemented both in Java and .NET platforms (I'm its lead developer): if you want to get it a try, I suggest you to check out its SVN repository on sourceforge.net as it keeps evolving.

Look for ABCpdf-Library, very easy and fast in my opinion.

What I got working is slightly different then anything else I have seen online.
So, just in case, I thought I would post this here to help someone else. I had to go through many different iterations to figure out - the hard way - what I needed to get it to work.
I am merging two PDFs into a third PDF, where one of the first two PDFs may have file attachments that need to be carried over into the third PDF. I am working completely in streams with ASP.NET, C# 4.0, ITextSharp 5.1.2.0.
// Extract Files from Submit PDF
Dictionary<string, byte[]> files = new Dictionary<string, byte[]>();
PdfDictionary names;
PdfDictionary embeddedFiles;
PdfArray fileSpecs;
int eFLength = 0;
names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream
if (names != null)
{
embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null
if (embeddedFiles != null)
{
fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null
if (fileSpecs != null)
{
eFLength = fileSpecs.Size;
for (int i = 0; i < eFLength; i++)
{
i++; //objects are in pairs and only want odd objects (1,3,5...)
PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null
if (fileSpec != null)
{
PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys)
{
PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));
if (stream != null)
{
files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream));
}
}
}
}
}
}
}

You may try Aspose.Pdf.Kit for .NET. The PdfExtractor class allows you to extract attachments with the help of two methods: ExtractAttachment and GetAttachment. Please see an example of attachment extraction.
Disclosure: I work as developer evangelist at Aspose.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove highlighted area in pdf using iTextSharp - c#

Related

Text extraction using itext7: garbage characters for some pdf documents

iText Acrofields Are Empty, form.getAsArray(PdfName.FIELDS) is also null

Merging Word documents using WordProcessingDocument

How to add in reply to annotation using iTextSharp

How do I extract attachments from a pdf file?

Categories

Resources