Update: 2021-01-15 - Added Bounty
I am trying to alter the redaction annotation to change the underlying text that gets burned into a PDF when you apply redactions. In Acrobat, you can set up a collection of "redaction codes" that can be used to identify why you are marking something as redacted. My goal is to overwrite what was selected by the user with a system defined value. The code will be ran prior to the redactions being applied.
In my attempts, I have discovered that the "preview" that is available in Acrobat products when hovering your cursor over a redact box is unique to Acrobat, and most other viewers won't show the preview. It also seems like the preview is maintained separately from the actual redaction that is applied. I don't need to alter the text that is shown in the preview, just what is shown after redactions are applied.
I have added a bounty of 150 reputation, as I don't think that I will be able to work out a solution on my own. My original question specified iText7, as that was the library that got me the closest in my own attempts. While I would prefer to use iText7, I will also consider solutions using other libraries that I can reasonably access (I do have a small budget that I could use to purchase another library, if I need to).
I've kept my original question and the follow-up with what I've personally tried below. I appreciate any help offered.
If you need a sample to test with, this DropBox folder has a file called 01 - Original.pdf that you can use as the source document. The desired result is to be able to change the text that appears when applying redactions from "Original Overlay Text" to any other value, such as "New Text".
Original Question:
I am trying to alter the text contained within every redaction annotation in a PDF, using iText7. The PdfRedactAnnotation object has a method called SetOverlayText() that looks like it should do what I want. So, I wrote a method that opens a PDF, loops through the pages, then loops through the annotations on each page, and checks if an annotation is a PdfRedactAnnotation. If it is, it calls SetOverlayText().
When debugging and looking at the annotation properties, I can see that the OverlayText has definitely changed. When I open the file and check the overlay text by hovering over a redaction marking with my cursor, however, the original overlay text is still there.
Additionally, if I apply the redactions, the original overlay text is what gets burned into the page.
However, when I right-click on the annotation (before applying redactions), the overlay text immediately gets updated to the new text:
At this point, when I apply redactions, it's the new text that is burned into the PDF.
Is there any way that I can trigger the Redaction Annotation update programmatically, without having to open and right-click on every one? I've included my code below. Thank you for any advice anyone might be able to offer.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"C:\temp\Test - Original.pdf"), new PdfWriter(#"C:\temp\Test - Output.pdf"));
Document doc = new Document(pdfDoc);
int pageCount = pdfDoc.GetNumberOfPages();
for (int i = 1; i <= pageCount; i++)
{
var annotations = pdfDoc.GetPage(i).GetAnnotations();
foreach(var annotation in annotations)
{
if (annotation is PdfRedactAnnotation)
{
PdfRedactAnnotation redact = (PdfRedactAnnotation)annotation;
redact.SetOverlayText(new PdfString("New Text"));
}
}
}
doc.Close();
Update: Findings as of 2021-01-07
As #mkl's answer points out, the PDF Redact Annotation Specification clarifies the underlying redact annotation DOM entries. OverlayText is just one part of the equation. If you use OverlayText then there must be a DA element defined (DA is a string that provides formatting info for the OverlayText). Finally, if RO is defined, it supersedes pretty much all of the other independent display entries.
My testing document was made using Acrobat DC Pro, by manually adding a redaction in Acrobat. Doing this results in a Redact annotation with all of the above entries set. Copies of my test documents can be found in this DropBox folder.
(Side note: In my original question, I mention hovering over the redaction's red rectangle in order to preview what the applied redaction will look like... After testing in multiple browsers and other PDF Viewers like Foxit Reader, it looks like the function to 'preview' what the redaction will look like when applied by hovering your mouse over the red outline is only supported in Acrobat products. All other viewers tested will only show the red border, with nothing occurring when you hover your cursor over it. The black rectangles shown above can only be viewed in other programs after redactions have been applied.
Additional testing has shown that the hover-over preview is maintained separately from the redaction details itself, with Acrobat operating to try to keep the hover-over details in-sync with the underlying annotation. It is best to ignore the hover-over preview when testing, and refer to the results after applying redactions.)
#mkl's recommendation to remove the RO entry in order to try to let the OverlayText take priority was a good idea, but it unfortunately didn't work. There was no notable difference from my original results.
After poking around in iText7's PdfRedactAnnotation, I found that the following methods all result in a reference to the Redact object's RO entry:
PdfRedactAnnotation redact = (PdfRedactAnnotation)annotation;
redact.GetRolloverAppearanceObject();
redact.GetRedactionRolloverAppearance();
redact.GetPdfObject().Get(PdfName.RO);
redact.GetAppearanceDictionary().Get(PdfName.R);
(I confirmed they are in fact the exact same reference by checking the equality comparator. As reference types, they all returned true when tested using ==).
On further testing, I have concluded that the RO property must have a copy of the same OverlayText stored internally. If you have two redactions with different original values, you can "copy" the RO element from one redaction to another:
PdfObject ro = firstRedact.GetPdfObject().Get(PdfName.RO);
secondRedact.GetPdfObject().Put(PdfName.RO, ro);
If you do this and apply redactions, the "overlay text" from the first redact will have replaced the "overlay text" in the second. The other RO element values are also copied (such as BBox, which defines the black rectangle's dimensions)... but at least those elements can be adjusted.
The problem remains that the iText7 PdfObject of RO has 7 sub elements, and none of them or their descendant elements appear to expose the text that I'm trying to change.
My final test was whether I could copy RO elements from one PDF to another (so that I could use a second source PDF with an annotation with the desired RO "overlay text" already configured), but it looks like indirect objects don't like being .Put() into other documents.
So now, I'm left with trying to either find a way to access/alter the text stored away in RO, or to clone a preconfigured RO from another document.
What does the specification say?
The OverlayText entry of redaction annotations is specified as
Key
Type
Value
OverlayText
text string
(Optional) A text string specifying the overlay text that should be drawn over the redacted region after the affected content has been removed. This entry is ignored if the RO entry is present.
(ISO 32000-2, Table 195 — Additional entries specific to a redaction annotation)
Maybe in your source PDF the redaction annotation has a RO taking precedence.
Furthermore, that table says this concerning the DA entry:
Key
Type
Value
DA
byte string
(Required if OverlayText is present, ignored otherwise) The appearance string that shall be used in formatting the overlay text when it is drawn after the affected content has been removed (see 12.7.4.3, "Variable text"). This entry is ignored if the RO entry is present.
If you use OverlayText, therefore, you also have to make sure the DA default appearance string is set. Did you?
The RO entry in the same table is specified as
Key
Type
Value
RO
stream
(Optional) A form XObject specifying the overlay appearance for this redaction annotation. After this redaction is applied and the affected content has been removed, the overlay appearance should be drawn such that its origin lines up with the lower-left corner of the annotation rectangle. This form XObject is not necessarily related to other annotation appearances, and may or may not be present in the AP dictionary. This entry takes precedence over the IC, OverlayText, DA, and Q entries.
So what to do now?
According to the details posted above, one obvious option to proceed is to create a redaction overlay XObject (RO) for the changed redaction annotations. You can do this by replacing your
if (annotation is PdfRedactAnnotation)
{
PdfRedactAnnotation redact = (PdfRedactAnnotation)annotation;
redact.SetOverlayText(new PdfString("New Text"));
}
by
if (annotation is PdfRedactAnnotation)
{
PdfRedactAnnotation redact = (PdfRedactAnnotation)annotation;
redact.SetOverlayText(new PdfString("New Text"));
Rectangle rectangle = redact.GetRectangle().ToRectangle();
PdfStream stream = redact.GetRedactRolloverAppearance();
if (stream != null)
{
rectangle = stream.GetAsArray(PdfName.BBox).ToRectangle();
}
PdfFormXObject redactionOverlay = new PdfFormXObject(rectangle);
redactionOverlay.GetPdfObject().Put(PdfName.Matrix, new PdfArray(new double[] { 1, 0, 0, 1, -rectangle.GetX(), -rectangle.GetY() }));
using (Canvas canvas = new Canvas(redactionOverlay, pdfDocument))
{
PdfCanvas pdfCanvas = canvas.GetPdfCanvas();
pdfCanvas.SetFillColorGray(0);
pdfCanvas.Rectangle(rectangle);
pdfCanvas.Fill();
pdfCanvas.SetFillColorGray(1);
canvas.Add(new Paragraph("New Text"));
}
stream = redactionOverlay.GetPdfObject();
redact.SetRolloverAppearance(stream);
redact.SetDownAppearance(stream);
redact.SetRedactRolloverAppearance(stream);
}
The result after redacting in Acrobat:
By adapting the used fill colors and the paragraph style you can make the appearance correspond more closely to the Adobe Acrobat generated appearances (or you alternatively can generate a look completely of your own design).
Beware, I only have a fairly old Adobe Acrobat version available, v9.5, so probably current versions don't accept a redaction appearance as generated above or at least apply it differently.
I was able to change the redaction annotation overlay text and, upon redaction, have that text display correctly over the redacted block. I used the SyncFusion Essential PDF library that is included as a part of SyncFusion File Formats. (I am not affiliated with SyncFusion, though I do have a paid license to their File Formats libraries through my employer.) I tested with Adobe Acrobat Pro DC.
When I first attempted to replace the redaction overlay text, I ran into a similar issue with SyncFusion as the OP did with iText 7: the overlay would display as changed after running my code, but redaction would bring back the formerly replaced overlay text. As there was no way to change both the displayed text overlay and the overlay text accessible by the redaction process, I got around this issue by writing code that makes the desired changes, exports the PDF's annotations to a JSON file, deletes the PDF's annotations, and then imports the JSON file back into the PDF. This generates new annotations that have the same text value for both the text overlay and the redaction process (the redaction process overlay text, I believe, is generated as a result of the creation of the PDF annotation). This is the code using SyncFusion Essential PDF:
using System.Drawing;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Interactive;
using Syncfusion.Pdf.Parsing;
using Syncfusion.Pdf;
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(#"C:\Users\Joe\Desktop\Redact\MarkedOriginal.pdf");
PdfLoadedPage page = loadedDocument.Pages[0] as PdfLoadedPage;
foreach (PdfLoadedRedactionAnnotation redactionAnnotation in loadedDocument.Pages[0].Annotations)
{
PdfStandardFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 10);
redactionAnnotation.Font = font;
redactionAnnotation.TextColor = Color.White;
redactionAnnotation.BorderColor = Color.Black; //See note in SO answer about this
redactionAnnotation.OverlayText = "New Text";
}
//Export, delete, and then import annotations to create a redaction annotation with the same preview and final redaction
loadedDocument.ExportAnnotations(#"C:\Users\Joe\Desktop\Redact\Output.json", AnnotationDataFormat.Json);
for (int i = 1; i <= loadedDocument.Pages[0].Annotations.Count; i++)
{
loadedDocument.Pages[0].Annotations.RemoveAt(i);
}
loadedDocument.ImportAnnotations(#"C:\Users\Joe\Desktop\Redact\Output.json", AnnotationDataFormat.Json);
loadedDocument.Save();
loadedDocument.Close(true);
If OP needs the border of the redaction marking boxes to be a color other than black, some more code will need to be written. I found that when I used redactionAnnotation.BorderColor = Color.Black; the redaction marking box looked as expected. However, when I used Color.Red or other colors, the border retained the black color with the new color also bordering the first redaction and only black bordering the second redaction in the file supplied by the OP. With further research, I suspect this can be remediated via SyncFusion, iText 7, or possibly by editing the JSON file's annotation defaultappearance line prior to importing the file back into the PDF. This is the defaultappearance line generated when I ran my code:
"defaultappearance": "1 1 1 RG 0 g 0 Tc 0 Tw 100 Tz 0 TL 0 Ts 0 Tr /Helv 10 Tf"
It's worth pointing out that SyncFusion has free and paid tiers for licensing their software. The SyncFusion Community License is, per SyncFusion, free for "companies and individuals with less than $1 million USD in annual gross revenue and 5 or fewer developers." The SyncFusion File Formats Developer License would cover everyone else.
Based on condition I need to hide one section and the section below should move above. So that while generating the PDF the hidden section should not show as blank.
Some clarification:
If you are doing this with an existing pdf, it is not likely to work. Pdf documents are not WYSIWYG format. Think of them more as containers of drawing-instructions than as containers of text.
Moving a section of an existing document will not work because:
the document itself contains no information on what instructions go together to make up lines, paragraphs, and sections
the document uses compression and byte-offsets, moving or deleting part of it would imply that you need to re-calculate all the byte-offsets
If you drop the requirement of re-flowing the text, it is certainly possible. iText already has an add-on for that called pdfSweep which look at all the drawing and rendering operations and removes the ones that intersect with a given rectangle (or adjusts them, for instance when a path goes through the rectangle)
If you are generating the pdf, this is of course trivial. You can simply do something like:
File outputFile = new File(System.getProperty("user.home"),"output.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outputFile));
Document layoutDocument = new Document(pdfDocument);
if(some_condition)
{
layoutDocument.add(new Paragraph("Lorem Ipsum Dolor Sit Amet"));
}
layoutDocument.add(new Paragraph("Never gonna give you up. Never gonna let you down."));
Check out http://itextpdf.com/itext7/pdfsweep
I need to extract text from a pdf document and I am using the iTextSharp library to do so. The issue is that the image has text on it which is not part of the image. I have been looking to find a way to get the coordinates of the image as there are annotations on the image being included in the text extraction:
e.g.
Results in the extraction:
Some text...
Text
Text
More Text..
However, the text in the image is irrelevant and needs to be ignored to give the output:
Some text...
More text...
Another issue is that there are multiple pages with this issue with images all being different sizes, however, all the text is always within the image bounds which is why if I have a way of determining the height and x, y coordinates of the image relative to the page I can extract the necessary data.
Currently, I need to just get the text but I will also need to extract the images at a later date.
I want to add Bookmark in my PDF using MigraDoc.
For example: two images on a single page.
1. Images1
2. Images2**
and the same name bookmark will generate.
If I click on image1 bookmark that image will be shown to me. Remember both images are on single page.
MigraDoc creates bookmarks automatically for headings.
To create bookmarks without visible text on the page, you can create headings with a font size of e.g. 0.0001 and white colour.
There is one drawback: up to PDFsharp 1.50 beta 1, these bookmarks jump to the correct page, but not the correct area on the page. So with two images on one page, the bookmarks will not work as intended by the OP.
I want to copy certain elements from one PDF to another using iTextSharp.
I want to read one PDF, read text elements from that and correct them and create a new PDF using the updated text elements and all the images etc. from the first PDF.
Please help me how this can be achieved.
This task is very complex. I wrote a program to do this for a large greeting card maker.
First you have to locate the text and calculate the glyph bounding boxes. Next you have to modify the contents stream to remove the text. The text may be broken into many pieces depending on the PDF creator. You have to remove those operators from the contents stream and adjust the CTM because some operators use relative positioning. Finally, you have to insert the replacement text, matching the original text's style (font, size, color, orientation, etc.)
As for copying elements from one PDF to another, most of the steps above are required plus you have to copy resources, eg. fonts, colorspaces, patterns, etc, to the new PDF.