Remove Javascript from PDF using iTextSharp - c#

This seems like something that should be quick to do, but in practice there seems to be a problem. I have a bunch of PDF forms that include form fields and embedded javascript. I would like to remove the javascript code safely, but leave the PDF form fields intact.
So far I've been able to find lots of solutions, but all the solutions have either eliminated both the javascript and the form fields, or left both intact.
Here's solution A; it copies both form fields and javascript:
var pdfReader = new PdfReader(infilename);
using (MemoryStream memoryStream = new MemoryStream()) {
PdfCopyFields copy = new PdfCopyFields(memoryStream);
copy.AddDocument(pdfReader);
copy.Close();
File.WriteAllBytes(rawfilename, memoryStream.ToArray());
}
Alternately, I have solution B, that strips out both form fields and javascript:
Document document = new Document();
using (MemoryStream memoryStream = new MemoryStream()) {
PdfWriter writer = PdfWriter.GetInstance(document, memoryStream);
document.Open();
document.AddDocListener(writer);
for (int p = 1; p <= pdfReader.NumberOfPages; p++) {
document.SetPageSize(pdfReader.GetPageSize(p));
document.NewPage();
PdfContentByte cb = writer.DirectContent;
PdfImportedPage pageImport = writer.GetImportedPage(pdfReader, p);
int rot = pdfReader.GetPageRotation(p);
if (rot == 90 || rot == 270) {
cb.AddTemplate(pageImport, 0, -1.0F, 1.0F, 0, 0, pdfReader.GetPageSizeWithRotation(p).Height);
} else {
cb.AddTemplate(pageImport, 1.0F, 0, 0, 1.0F, 0, 0);
}
}
document.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}
Does anyone know how to modify either solution A or B to eliminate the javascript but leave the form fields in place?
EDIT: Solution code is here!
using (MemoryStream memoryStream = new MemoryStream()) {
PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
for (int i = 0; i <= pdfReader.XrefSize; i++) {
object o = pdfReader.GetPdfObject(i);
PdfDictionary pd = o as PdfDictionary;
if (pd != null) {
pd.Remove(PdfName.AA);
pd.Remove(PdfName.JS);
pd.Remove(PdfName.JAVASCRIPT);
}
}
stamper.Close();
pdfReader.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}

To manipulate a single PDF you should use the class PdfStamper and manipulate its contents, in your case iterating over the existing form fields and removing the JavaScript entries.
The iTextSharp sample AddJavaScriptToForm.cs corresponding to AddJavaScriptToForm.java from chapter 13 of iText in Action — 2nd Edition shows how JavaScript actions are added to fields, the central code being:
PdfStamper stamper = new PdfStamper(reader, ms);
AcroFields form = stamper.AcroFields;
AcroFields.Item fd = form.GetFieldItem("married");
PdfDictionary dictYes = (PdfDictionary) PdfReader.GetPdfObject(fd.GetWidgetRef(0));
PdfDictionary yesAction = ...;
dictYes.Put(PdfName.AA, yesAction);
Thus, to remove such JavaScript form field actions you have to iterate over all those PDF form fields and remove the /AA values in the associated dictionaries:
dictXXX.Remove(PdfName.AA);
EDIT: (provided by Ted Spence) Here is the final code that successfully removes javascript while leaving all form fields intact:
using (MemoryStream memoryStream = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
for (int i = 0; i <= pdfReader.XrefSize; i++)
{
PdfDictionary pd = pdfReader.GetPdfObject(i) as PdfDictionary;
if (pd != null)
{
pd.Remove(PdfName.AA); // Removes automatic execution objects
pd.Remove(PdfName.JS); // Removes javascript objects
pd.Remove(PdfName.JAVASCRIPT); // Removes other javascript objects
}
}
stamper.Close();
pdfReader.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}
EDIT: (by mkl) The solution above is somewhat overachieving because it touches each and every indirect dictionary object. On the other hand it ignores inline dictionaries (I haven't checked the spec, though; maybe all /AA, /JS, and /JAVASCRIPT entries appear only in dictionaries which have to be indirect objects, or at least are de-referenced by this code).
If fulfilling this task was my job, I would try and access the objects possibly carrying JavaScript more specifically.
The advantage of this overachieving procedure might be, though, that even PDF objects are inspected which currently are not specified as carrying JavaScript but will be in later PDF versions.

Add the following lines after the for loop to keep the AcroForm:
var form = pdfReader.AcroForm;
if (form != null)
writer.CopyAcroForm(reader);

Related

PDF modify takes time in C#

I'm using iTextSharp to modify a pdf file and to add specific data into it.
Here is the scenario I have, I have a DataTable that contains thousands of rows, each row represents a customer or client, and I have one pdf template.
I have to modify the pdf template for each row (client or customer) to add their id to the file in addition to other data and then it will be added to the DataTable for that client.
I'm using this code to do the needed job, but it times out when processing huge amount of rows or it takes at least 15 mins, 4k rows in my case- because this means opening the pdf file 4k times and modifying it as needed.
// file is the pdf tmeplate, id is the customer's id - represens 1 row in the DataTable, landingPage: is client's specific page should be added to the file
private static byte[] GeneratePdfFromPdfFile(byte[] file, int id, string landingPage)
{
try
{
using (var ms = new MemoryStream())
{
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
var doc = new iTextSharp.text.Document();
//Create a writer that's bound to our PDF abstraction and our stream
var writer = PdfWriter.GetInstance(doc, ms);
//Open the document for writing
doc.Open();
PdfContentByte cb = writer.DirectContent;
////parse html code to xml
//iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
PdfReader reader = new PdfReader(file);
for (int pageNumber = 1; pageNumber < reader.NumberOfPages + 1; pageNumber++)
{
doc.SetPageSize(reader.GetPageSizeWithRotation(1));
doc.NewPage();
//Insert to Destination on the first page
PdfImportedPage page = writer.GetImportedPage(reader, pageNumber);
int rotation = reader.GetPageRotation(pageNumber);
if (rotation == 90 || rotation == 270)
{
cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(pageNumber).Height);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
// Add a new page to the pdf file
doc.NewPage();
// set pdf open action to open the link embedded in the file.
string _embeddedURL = "http://" + landingPage + "/Default.aspx?code=" + id;
PdfAction act = new PdfAction(_embeddedURL);
writer.SetOpenAction(act);
doc.Close();
return ms.ToArray();
}
}
catch { return null; }
}
Note: I'm using ForEach loop to iterate through the DataTable rows
Your question (the text) doesn't correspond with your code (the C# part). You are asking one thing (add content to an existing PDF), and doing another thing (add an option action to an existing PDF), but regardless of what it is that you want to do, you shouldn't use PdfWriter to make a shallow copy of every page of an existing PDF (throwing away all interactivity).
You can significantly reduce the lines of code you wrote by using PdfStamper:
PdfReader reader = new PdfReader(file);
MemoryStream ms = new MemoryStream();
PdfStamper stamper = new PdfStamper(reader, ms);
string _embeddedURL = "http://" + landingPage + "/Default.aspx?code=" + id;
PdfAction act = new PdfAction(_embeddedURL);
stamper.Writer.SetOpenAction(act);
stamper.Close();
reader.Close();
return ms.ToArray();
Please read chapter 6 of my book to see why PdfWriter is the wrong class for you. PdfStamper is meant to manipulate an existing PDF document, keeping all its existing features intact.
Also be aware that you're using an old version of iText. The most recent version is iText 7. Please consult the Jump-Start tutorial for more info on using the latest iText version.

C# iTextSharp Merge multiple pdf via byte array

I am new to using iTextSharp and working with Pdf files in general, but I think I'm on the right track.
I iterate through a list of pdf files, convert them to bytes, and push all of the resulting bytes into a byte array. From there I pass the byte array to concatAndAddContent() to merge all of the pdf's into a single large pdf. Currently I'm just getting the last pdf in the list (they seem to be overwriting)
public static byte[] concatAndAddContent(List<byte[]> pdfByteContent)
{
byte[] allBytes;
using (MemoryStream ms = new MemoryStream())
{
Document doc = new Document();
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
doc.SetPageSize(PageSize.LETTER);
doc.Open();
PdfContentByte cb = writer.DirectContent;
PdfImportedPage page;
PdfReader reader;
foreach (byte[] p in pdfByteContent)
{
reader = new PdfReader(p);
int pages = reader.NumberOfPages;
// loop over document pages
for (int i = 1; i <= pages; i++)
{
doc.SetPageSize(PageSize.LETTER);
doc.NewPage();
page = writer.GetImportedPage(reader, i);
cb.AddTemplate(page, 0, 0);
}
}
doc.Close();
allBytes = ms.GetBuffer();
ms.Flush();
ms.Dispose();
}
return allBytes;
}
Above is the working code that results in a single pdf being created, and the rest of the files are being ignored. Any suggestions
This is pretty much just a C# version of Bruno's code here.
This is pretty much the simplest, safest and recommended way to merge PDF files. The PdfSmartCopy object is able to detect redundancies in the multiple files which can reduce file size some times. One of the overloads on it accepts a full PdfReader object which can be instantiated however you want.
public static byte[] concatAndAddContent(List<byte[]> pdfByteContent) {
using (var ms = new MemoryStream()) {
using (var doc = new Document()) {
using (var copy = new PdfSmartCopy(doc, ms)) {
doc.Open();
//Loop through each byte array
foreach (var p in pdfByteContent) {
//Create a PdfReader bound to that byte array
using (var reader = new PdfReader(p)) {
//Add the entire document instead of page-by-page
copy.AddDocument(reader);
}
}
doc.Close();
}
}
//Return just before disposing
return ms.ToArray();
}
}
List<byte[]> finallist= new List<byte[]>();
finallist.Add(concatAndAddContent(bytes));
System.IO.File.WriteAllBytes("path",finallist);

ITextSharp - using PdfStamper resulting MemoryStream to close

I'm using ITextSharp to split multi-page PDF files into single page files. I also managed to add those single page PDFs to a zip file using MemoryStream.
Now, I need to add password protection to those PDFs using PdfStamper, before adding them into a zip file. But whenever I tried this, an ObjectDisposedException - Cannot access a closed Stream. is being throwed.
Ionic.Zip.ZipFile zipFile = new Ionic.Zip.ZipFile();
int cnt = 0;
try
{
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdfPath), new ASCIIEncoding().GetBytes(""));
for (cnt = 1; cnt <= reader.NumberOfPages; cnt++)
{
using (MemoryStream memoryStream = new MemoryStream())
{
using (iTextSharp.text.Document document = new iTextSharp.text.Document())
{
iTextSharp.text.pdf.PdfWriter writer = iTextSharp.text.pdf.PdfWriter.GetInstance(document, memoryStream);
using (PdfStamper stamper = new PdfStamper(reader, memoryStream))
{
stamper.SetEncryption(
null,
Encoding.ASCII.GetBytes("password_here"),
PdfWriter.ALLOW_PRINTING,
PdfWriter.ENCRYPTION_AES_128);
}
memoryStreamForZipFile = new MemoryStream(memoryStream.ToArray());
memoryStreamForZipFile.Seek(0, SeekOrigin.Begin);
}
}
}
zipFile.Save(destinationFolder + "/" + fileName.Replace(".pdf", ".zip"));
reader.Close();
reader.Dispose();
}
catch
{
}
finally
{
GC.Collect();
}
return cnt - 1;
I have removed some codes above for clarity.
If I'll remove the PdfStamper "using" block, the code works just fine. I also tried to juggle the position of PdfStamper to see if I used it in the wrong place.
Am I not using using blocks properly? Or I have to fix some code sequence in here?
You removed some lines that are essential are wrong; for instance: I assume that you are adding a PdfImportedPage to the PdfContentByte of a PdfWriter. If that's so, you are ignoring all the warnings given in the official documentation.
You should replace your code by something like this:
PdfReader reader = new PdfReader(pathToFile);
int n = reader.NumberOfPages;
int cnt;
for (cnt = 1; cnt <= reader.NumberOfPages; cnt++)
{
reader = new PdfReader(pathToFile);
reader.SelectPages(cnt.ToString());
MemoryStream memoryStream = new MemoryStream();
using (PdfStamper stamper = new PdfStamper(reader, memoryStream))
{
stamper.SetEncryption(
null,
Encoding.ASCII.GetBytes("password_here"),
PdfWriter.ALLOW_PRINTING,
PdfWriter.ENCRYPTION_AES_128);
}
reader.Close();
// now do something with the memoryStream.ToArray()
}
As you can see, there is no need to introduce a Document or a PdfWriter object. If you use those classes, you throw away all interactivity that exists in the original pages. You also get into trouble if the page size of the original pages is different from A4.
Note that you can't reuse the PdfReader instance when using PdfStamper. Once you pass a PdfReader instance to a PdfStamper, that instance is tampered.

Adding Page Navigation links to PDF document using itextsharp [duplicate]

I have written some code that merges together multiple PDF's into a single PDF that I then display from the MemoryStream. This works great. What I need to do is add a table of contents to the end of the file with links to the start of each of the individual PDF's. I planned on doing this using the GotoLocalPage action which has an option for page numbers but it doesn't seem to work. If I change the action to the code below to one of the presset ones like PDFAction.FIRSTPAGE it works fine. Does this not work because I am using the PDFCopy object for the writer parameter of GotoLocalPage?
Document mergedDoc = new Document();
MemoryStream ms = new MemoryStream();
PdfCopy copy = new PdfCopy(mergedDoc, ms);
mergedDoc.Open();
MemoryStream tocMS = new MemoryStream();
Document tocDoc = null;
PdfWriter tocWriter = null;
for (int i = 0; i < filesToMerge.Length; i++)
{
string filename = filesToMerge[i];
PdfReader reader = new PdfReader(filename);
copy.AddDocument(reader);
// Initialise TOC document based off first file
if (i == 0)
{
tocDoc = new Document(reader.GetPageSizeWithRotation(1));
tocWriter = PdfWriter.GetInstance(tocDoc, tocMS);
tocDoc.Open();
}
// Create link for TOC, added random number of 3 for now
Chunk link = new Chunk(filename);
PdfAction action = PdfAction.GotoLocalPage(3, new PdfDestination(PdfDestination.FIT), copy);
link.SetAction(action);
tocDoc.Add(new Paragraph(link));
}
// Add TOC to end of merged PDF
tocDoc.Close();
PdfReader tocReader = new PdfReader(tocMS.ToArray());
copy.AddDocument(tocReader);
copy.Close();
displayPDF(ms.ToArray());
I guess an alternative would be to link to a named element (instead of page number) but I can't see how to add an 'invisible' element to the start of each file before adding to the merged document?
I would just go with two passes. In your first pass, do the merge as you are but also record the filename and page number it should link to. In your second pass, use a PdfStamper which will give you access to a ColumnText that you can use general abstractions like Paragraph in. Below is a sample that shows this off:
Since I don't have your documents, the below code creates 10 documents with a random number of pages each just for testing purposes. (You obviously don't need to do this part.) It also creates a simple dictionary with a fake file name as the key and the raw bytes from the PDF as a value. You have a true file collection to work with but you should be able to adapt that part.
//Create a bunch of files, nothing special here
//files will be a dictionary of names and the raw PDF bytes
Dictionary<string, byte[]> Files = new Dictionary<string, byte[]>();
var r = new Random();
for (var i = 1; i <= 10; i++) {
using (var ms = new MemoryStream()) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, ms)) {
doc.Open();
//Create a random number of pages
for (var j = 1; j <= r.Next(1, 5); j++) {
doc.NewPage();
doc.Add(new Paragraph(String.Format("Hello from document {0} page {1}", i, j)));
}
doc.Close();
}
}
Files.Add("File " + i.ToString(), ms.ToArray());
}
}
This next block merges the PDFs. This is mostly the same as your code except that instead of writing a TOC here I'm just keeping track of what I want to write in the future. Where I'm using file.value you'd use your full file path and where I'm using file.key you'd use your file's name instead.
//Dictionary of file names (for display purposes) and their page numbers
var pages = new Dictionary<string, int>();
//PDFs start at page 1
var lastPageNumber = 1;
//Will hold the final merged PDF bytes
byte[] mergedBytes;
//Most everything else below is standard
using (var ms = new MemoryStream()) {
using (var document = new Document()) {
using (var writer = new PdfCopy(document, ms)) {
document.Open();
foreach (var file in Files) {
//Add the current page at the previous page number
pages.Add(file.Key, lastPageNumber);
using (var reader = new PdfReader(file.Value)) {
writer.AddDocument(reader);
//Increment our current page index
lastPageNumber += reader.NumberOfPages;
}
}
}
}
mergedBytes = ms.ToArray();
}
This last block actually writes the TOC. If we use a PdfStamper we can create a ColumnText which allows us to use Paragraphs
//Will hold the final PDF
byte[] finalBytes;
using (var ms = new MemoryStream()) {
using (var reader = new PdfReader(mergedBytes)) {
using (var stamper = new PdfStamper(reader, ms)) {
//The page number to insert our TOC into
var tocPageNum = reader.NumberOfPages + 1;
//Arbitrarily pick one page to use as the size of the PDF
//Additional logic could be added or this could just be set to something like PageSize.LETTER
var tocPageSize = reader.GetPageSize(1);
//Arbitrary margin for the page
var tocMargin = 20;
//Create our new page
stamper.InsertPage(tocPageNum, tocPageSize);
//Create a ColumnText object so that we can use abstractions like Paragraph
var ct = new ColumnText(stamper.GetOverContent(tocPageNum));
//Set the working area
ct.SetSimpleColumn(tocPageSize.GetLeft(tocMargin), tocPageSize.GetBottom(tocMargin), tocPageSize.GetRight(tocMargin), tocPageSize.GetTop(tocMargin));
//Loop through each page
foreach (var page in pages) {
var link = new Chunk(page.Key);
var action = PdfAction.GotoLocalPage(page.Value, new PdfDestination(PdfDestination.FIT), stamper.Writer);
link.SetAction(action);
ct.AddElement(new Paragraph(link));
}
ct.Go();
}
}
finalBytes = ms.ToArray();
}

Returning memorystream - gives corrupt PDF file or "cannot accessed a closed stream"

I have a web service, which calls the following method. I want to return a memorystream, which is a PDF file.
Now, the problem is the PDF file is corrupt with the following code. I think it's because the files are not being closed. However, if I close them, I get the classic error "Cannot access a closed stream".
When I previously saved it through a filestream, the PDF file wasn't corrupt.
So my humble question is: How to solve it and get back a non-corrupt PDF file? :-)
My code:
public Stream Generate(GiftModel model)
{
var template = HttpContext.Current.Server.MapPath(TemplatePath);
// Magic code which creates a new PDF file from the stream of the other
PdfReader reader = new PdfReader(template);
Rectangle size = reader.GetPageSizeWithRotation(1);
Document document = new Document(size);
MemoryStream fs = new MemoryStream();
PdfWriter writer = PdfWriter.GetInstance(document, fs);
document.Open();
// Two products on every page
int bookNumber = 0;
int pagesWeNeed = (int)Math.Ceiling(((double)model.Books.Count / (double)2));
for (var i = 0; i < pagesWeNeed; i++)
{
PdfContentByte cb = writer.DirectContent;
// Creates a new page
PdfImportedPage page = writer.GetImportedPage(reader, 1);
cb.AddTemplate(page, 0, 0);
// Add text strings
DrawGreetingMessages(model.FromName, model.ReceiverName, model.GiftMessage, cb);
// Draw the books
DrawBooksOnPage(model.Books.Skip(bookNumber).Take(2).ToList(), cb);
// Draw boring shit
DrawFormalities(true, model.GiftLink, cb);
bookNumber += 2;
}
// Close off our streams because we can
//document.Close();
//writer.Close();
reader.Close();
fs.Position = 0;
return fs;
}
Reuse of streams can be problematic, especially if you are using an abstraction and you don't quite know what it is doing to your stream. Because of this I generally recommend never passing streams themselves around. If you can by with it, try just passing the raw underlying byte array itself. But if passing streams is a requirement then I recommend still doing the raw byte array at the end and then wrapping that in a new second stream. Try the below code to see if it works.
public Stream Generate(GiftModel model)
{
//We'll dump our PDF into these when done
Byte[] bytes;
using (var ms = new MemoryStream())
{
using (var doc = new Document())
{
using (var writer = PdfWriter.GetInstance(doc, ms))
{
doc.Open();
doc.Add(new Paragraph("Hello"));
doc.Close();
}
}
//Finalize the contents of the stream into an array
bytes = ms.ToArray();
}
//Return a new stream wrapped around our array
return new MemoryStream(bytes);
}

Categories

Resources