Modify existing pdf (add/remove pages) while preserving metadata - c#

my target is to open an existing pdf, add or remove some pages while preserving the metadata (Author, Subject, ...) in a Windows.Forms C# application.
I use iTextSharp and found examples how to add or remove pages by using the PdfConcatenate class. To keep the metadata I use a PdfStamper afterwards. To speed things up I want to do the modifications in memory before storing the result to disk.
The problem is NOT adding or removing the pages but to keep the metadata in the same step.
So can anybody tell me/giva an example on how to achieve this (better) or am I on the completely wrong track?
Here my current code (see comments for problem related lines):
public void RemovePagesInFile(string documentLocation, int pageIndexFrom, int pageCount)
{
// TB: open the pdf
using (PdfReader sourcePdfReader = new PdfReader(documentLocation))
using (MemoryStream concatenatedTargetStream = new MemoryStream((int)sourcePdfReader.FileLength))
{
// TB: use a concatenator to create a new pdf containing only the desired pages
PdfConcatenate concatenator = new PdfConcatenate(concatenatedTargetStream);
// TB: create a list with the page numbers to keep
List<int> pagesToKeep = new List<int>();
for (int i = 1; i <= pageIndexFrom; i++)
{
pagesToKeep.Add(i);
}
for (int i = pageIndexFrom + pageCount + 1; i <= sourcePdfReader.NumberOfPages; i++)
{
pagesToKeep.Add(i);
}
// TB: execute the page copy
sourcePdfReader.SelectPages(pagesToKeep);
concatenator.AddPages(sourcePdfReader);
// TB: problem(s) here:
// 1. when calling concatenator.Close() the memory stream gets disposed as expected.
// concatenator.Close();
// 2. even when calling concatenator.WriterFlush() the memory stream seems to be missing content (error when creating targetReader (see below)).
// concatenator.Writer.Flush();
// 3. when keeping concatenator open the same error as above occures (I assume not all bytes have been written to the memory stream)
// TB: preserve the meta data from the source document
// => ERROR here: "Rebuild trailer not found. Original Error: PDF startxref not found"
using (PdfReader targetReader = new PdfReader(concatenatedTargetStream))
using (MemoryStream targetStream = new MemoryStream((int)concatenatedTargetStream.Length))
{
using (PdfStamper stamper = new PdfStamper(targetReader, targetStream))
{
stamper.MoreInfo = sourcePdfReader.Info;
// TB: same problem as above with stamper ?
stamper.Close();
}
// TB: close the reader to be able to access the source pdf
sourcePdfReader.Close();
// TB: write the modified pdf to the disk
File.WriteAllBytes(documentLocation, targetStream.ToArray());
}
}
}

Two changes need to be made. Call
concatenator.Writer.CloseStream = false
before calling
concatenator.Close()
Do the same thing for the PdfStamper and you're set.

Related

Open XML WordprocessingDocument with MemoryStream is 0KB

I am trying to learn how to use with Microsoft's Open XML SDK. I followed their steps on how to create a Word document using a FileStream and it worked perfectly. Now I want to create a Word document but only in memory, and wait for the user to specify whether they would like to save the file or not.
This document by Microsoft says how to deal with in-memory documents using MemoryStream, however, the document is first loaded from an existing file and "dumped" into a MemorySteam. What I want is to create a document entirely in memory (not based on a file in a drive). What I thought would achieve that was this code:
// This is almost the same as Microsoft's code except I don't
// dump any files into the MemoryStream
using (var mem = new MemoryStream())
{
using (var doc = WordprocessingDocument.Create(mem, WordprocessingDocumentType.Document, true))
{
doc.AddMainDocumentPart().Document = new Document();
var body = doc.MainDocumentPart.Document.AppendChild(new Body());
var paragraph = body.AppendChild(new Paragraph());
var run = paragraph.AppendChild(new Run());
run.AppendChild(new Text("Hello docx"));
using (var file = new FileStream(destination, FileMode.CreateNew))
{
mem.WriteTo(file);
}
}
}
But the result is a file that is 0KB and that can't be read by Word. At first I thought it was because of the size of the MemoryStream so I provided it with an initial size of 1024 but the results were the same. On the other hand if I change the MemoryStream for a FileStreamit works perfectly.
My question is whether what I want to do is possible, and if so, how? I guess it must be possible, just not how I'm doing it. If it isn't possible what alternative do I have?
There's a couple of things going on here:
First, unlike Microsoft's sample, I was nesting the using block code that writes the file to disk inside the block that creates and modifies the file. The WordprocessingDocument gets saved to the stream until it is disposed or when the Save() method is called. The WordprocessingDocument gets disposed automatically when reaching the end of it's using block. If I had not nested the third using statement, thus reaching the end of the second using statement before trying to save the file, I would have allowed the document to be written to the MemoryStream- instead I was writing a still empty stream to disk (hence the 0KB file).
I suppose calling Save()might have helped, but it is not supported by .Net core (which is what I'm using). You can check whether Save()is supported on you system by checking CanSave.
/// <summary>
/// Gets a value indicating whether saving the package is supported by calling <see cref="Save"/>. Some platforms (such as .NET Core), have limited support for saving.
/// If <c>false</c>, in order to save, the document and/or package needs to be fully closed and disposed and then reopened.
/// </summary>
public static bool CanSave { get; }
So the code ended up being almost identical to Microsoft's code except I don't read any files beforehand, rather I just begin with an empty MemoryStream:
using (var mem = new MemoryStream())
{
using (var doc = WordprocessingDocument.Create(mem, WordprocessingDocumentType.Document, true))
{
doc.AddMainDocumentPart().Document = new Document();
var body = doc.MainDocumentPart.Document.AppendChild(new Body());
var paragraph = body.AppendChild(new Paragraph());
var run = paragraph.AppendChild(new Run());
run.AppendChild(new Text("Hello docx"));
}
using (var file = new FileStream(destination, FileMode.CreateNew))
{
mem.WriteTo(file);
}
}
Also you don't need to reopen the document before saving it, but if you do remember to use Open() instead of Create() because Create() will empty the MemoryStream and you'll also end with a 0KB file.
You're passing mem to WordprocessingDocument.Create(), which is creating the document from the (now-empty) MemoryStream, however, I don't think that is associating the MemoryStream as the backing store of the document. That is, mem is only the input of the document, not the output as well. Therefore, when you call mem.WriteTo(file);, mem is still empty (the debugger would confirm this).
Then again, the linked document does say "you must supply a resizable memory stream to [Open()]", which implies that the stream will be written to, so maybe mem does become the backing store but nothing has been written to it yet because the AutoSave property (for which you specified true in Create()) hasn't had a chance to take effect yet (emphasis mine)...
Gets a flag that indicates whether the parts should be saved when disposed.
I see that WordprocessingDocument has a SaveAs() method, and substituting that for the FileStream in the original code...
using (var mem = new MemoryStream())
using (var doc = WordprocessingDocument.Create(mem, WordprocessingDocumentType.Document, true))
{
doc.AddMainDocumentPart().Document = new Document();
var body = doc.MainDocumentPart.Document.AppendChild(new Body());
var paragraph = body.AppendChild(new Paragraph());
var run = paragraph.AppendChild(new Run());
run.AppendChild(new Text("Hello docx"));
// Explicitly close the OpenXmlPackage returned by SaveAs() so destination doesn't stay locked
doc.SaveAs(destination).Close();
}
...produces the expected file for me. Interestingly, after the call to doc.SaveAs(), and even if I insert a call to doc.Save(), mem.Length and mem.Position are both still 0, which does suggest that mem is only used for initialization.
One other thing I would note is that the sample code is calling Open(), whereas you are calling Create(). The documentation is pretty sparse as far as how those two methods differ, but I would have suggested you try creating your document with Open() instead...
using (MemoryStream mem = new MemoryStream())
using (WordprocessingDocument doc = WordprocessingDocument.Open(mem, true))
{
// ...
}
...however when I do that Open() throws an exception, presumably because mem has no data. So, it seems the names are somewhat self-explanatory in that Create() initializes new document data whereas Open() expects existing data. I did find that if I feed Create() a MemoryStream filled with random garbage...
using (var mem = new MemoryStream())
{
// Fill mem with garbage
byte[] buffer = new byte[1024];
new Random().NextBytes(buffer);
mem.Write(buffer, 0, buffer.Length);
mem.Position = 0;
using (var doc = WordprocessingDocument.Create(mem, WordprocessingDocumentType.Document, true))
{
// ...
}
}
...it still produces the exact same document XML as the first code snippet above, which makes me wonder why Create() even needs an input Stream at all.
I was facing the same problem today, after all, the solution is closing the document to fill the memorystream, here is the example, Lance U. Matthews's example help me alot, and finally I realized, after cheking others document types exports, after fill thems, each one calls method Close, but, Microsoft example doesn't show it
private MemoryStream GenerateWord(DataTable dt)
{
MemoryStream mStream = new MemoryStream();
// Create Document
OpenXMLPackaging.WordprocessingDocument wordDocument =
OpenXMLPackaging.WordprocessingDocument.Create(mStream, OpenXML.WordprocessingDocumentType.Document, true);
// Add a main document part.
OpenXMLPackaging.MainDocumentPart mainPart = wordDocument.AddMainDocumentPart();
mainPart.Document = new OpenXMLWordprocessing.Document();
OpenXMLWordprocessing.Body body = mainPart.Document.AppendChild(new OpenXMLWordprocessing.Body());
OpenXMLWordprocessing.Table table = new OpenXMLWordprocessing.Table();
body.AppendChild(table);
OpenXMLWordprocessing.TableRow tr = new OpenXMLWordprocessing.TableRow();
foreach (DataColumn c in dt.Columns)
{
tr.Append(new OpenXMLWordprocessing.TableCell(new OpenXMLWordprocessing.Paragraph(new OpenXMLWordprocessing.Run(new OpenXMLWordprocessing.Text(c.ColumnName.ToString())))));
}
table.Append(tr);
foreach (DataRow r in dt.Rows)
{
if (dt.Rows.Count > 0)
{
OpenXMLWordprocessing.TableRow dataRow = new OpenXMLWordprocessing.TableRow();
for (int h = 0; h < dt.Columns.Count; h++)
{
dataRow.Append(new OpenXMLWordprocessing.TableCell(new OpenXMLWordprocessing.Paragraph(new OpenXMLWordprocessing.Run(new OpenXMLWordprocessing.Text(r[h].ToString())))));
}
table.Append(dataRow);
}
}
mainPart.Document.Save();
wordDocument.Close();
mStream.Position = 0;
return mStream;
}

Merging N pdf files, created from html using ITextSharp, to another blank pdf file

I need to merge N PDF files into one. I create a blank file first
byte[] pdfBytes = null;
var ms = new MemoryStream();
var doc = new iTextSharp.text.Document();
var cWriter = new PdfCopy(doc, ms);
Later I cycle through html strings array
foreach (NBElement htmlString in someElement.Children())
{
byte[] msTempDoc = getPdfDocFrom(htmlString.GetString(), cssString.GetString());
addPagesToPdf(cWriter, msTempDoc);
}
In getPdfDocFrom I create pdf file using XMLWorkerHelper and return it as byte array
private byte[] getPdfDocFrom(string htmlString, string cssString)
{
var tempMs = new MemoryStream();
byte[] tempMsBytes;
var tempDoc = new iTextSharp.text.Document();
var tempWriter = PdfWriter.GetInstance(tempDoc, tempMs);
tempDoc.Open();
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(cssString)))
{
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(htmlString)))
{
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(tempWriter, tempDoc, msHtml, msCss);
tempMsBytes = tempMs.ToArray();
}
}
tempDoc.Close();
return tempMsBytes;
}
Later on I try to add pages from this PDF file to the blank one.
private static void addPagesToPdf(PdfCopy mainDocWriter, byte[] sourceDocBytes)
{
using (var msOut = new MemoryStream())
{
PdfReader reader = new PdfReader(new MemoryStream(sourceDocBytes));
int n = reader.NumberOfPages;
PdfImportedPage page;
for (int i = 1; i <= n; i++)
{
page = mainDocWriter.GetImportedPage(reader, i);
mainDocWriter.AddPage(page);
}
}}
It breaks when it tries to create a PdfReader from the byte array I pass to the function. "Rebuild failed: trailer not found.; Original message: PDF startxref not found."
I used another library to work with PDF before. I passed 2 PdfDocuments as an objects and just added pages from one to another in cycle. It didn't support Css though, so I had to switch to ITextSharp.
I don't quite get the difference between PdfWriter and PdfCopy.
There a logical error in your code. When you create a document from scratch as is done in the getPdfDocFrom() method, the document isn't complete until you've triggered the Close() method. In this Close() method, a trailer is created as well as a cross-reference (xref) table. The error tells you that those are missing.
Indeed, you do call the Close() method:
tempDoc.Close();
But by the time you Close() the document, it's too late: you have already created the tempMsBytes array. You need to create that array after you close the document.
Edit: I don't know anything about C#, but if MemoryStream clears its buffer after closing it, you could use mainDocWriter.CloseStream = false; so that the MemoryStream isn't closed when you close the document.
In Java, it would be a bad idea to set the "close stream" parameter to false. When I read the answers to the question Create PDF in memory instead of physical file I see that C# probably doesn't always require this extra line.
Remark: merging files by adding PdfImportedPage instances to a PdfWriter is an example of bad taste. If you are using iTextSharp 5 or earlier, you should use PdfCopy or PdfSmartCopy to do that. If you use PdfWriter, you throw away a lot of information (e.g. link annotations).

PdfStamper being disposed

The PdfStamper I'm passing in to this method is being disposed of at the end of the method - why, and how do I stop it? I'm trying to create a page object from the template, which I can then add to the PdfStamper X number of times.
//real code
public void DoSpecialAction(PdfStamper pdfStamper)
{
using (var pdfTemplate = new PdfReader(_extraPageTemplatePath))
using (var pdfReader = new PdfReader(pdfTemplate))
{
PdfImportedPage page = pdfStamper.GetImportedPage(pdfReader, 1);
pdfStamper.InsertPage(3, pdfReader.GetPageSize(1));
PdfContentByte pb = pdfStamper.GetUnderContent(3);
pb.AddTemplate(page, 0, 0);
}
}
the program structure is as follows:
//psuedocode
class PrintFieldsToPdf {
foreach (normalfield) {
PrintNormalFields();
}
foreach (specialaction) {
DoSpecialAction(pdfStamper);
}
pdfStamper.Close(); //at this point the object has been deallocated
}
Throwing the following exception:
An exception of type 'System.ObjectDisposedException' occurred in mscorlib.dll but was not handled in user code
Additional information: Cannot access a closed file.
The OP eventually commented:
I have a hunch it may be that the page object never actually gets copied until the PdfStamper calls Close and writes the file, and therefore the PdfReader I'm using to read the extra page template is causing the issue, as it is disposed of at the end of my method, before PdfStamper is closed.
His hunch was correct: The copying of at least certain parts of the original page is delayed until the PdfStamper is being closed. This allows for certain optimizations in case multiple pages from the same PdfReader instance are imported in separate calls.
The use case of imports from many different PdfReaders had also been on the mind of the iText(Sharp) developers. So they provided a way to tell the PdfStamper to copy everything required from a given PdfReader at the time the user is sure he won't copy anything else from it:
public void DoSpecialAction(PdfStamper pdfStamper)
{
using (var pdfTemplate = new PdfReader(_extraPageTemplatePath))
using (var pdfReader = new PdfReader(pdfTemplate))
{
PdfImportedPage page = pdfStamper.GetImportedPage(pdfReader, 1);
pdfStamper.InsertPage(3, pdfReader.GetPageSize(1));
PdfContentByte pb = pdfStamper.GetUnderContent(3);
pb.AddTemplate(page, 0, 0);
// Copy everything required from the PdfReader
pdfStamper.Writer.FreeReader(pdfReader);
}
}

ITextSharp returning same size pdf when I'm trying to compress pdf file with different levels

I'm reading a pdf and injecting some content using itextsharp. The resulting byte[] is passed to the method below along with the compression level.
public static byte[] method(byte[] pdf,int compressionlevel)
{
using (MemoryStream outputPdfStream1 = new MemoryStream())
{
//PdfReader reader1 = new PdfReader(pdf);
//PdfStamper stamper1 = new PdfStamper(reader1, outputPdfStream1);
//int level = (int)compressionlevel;
//if (level <= 9)
// stamper1.Writer.CompressionLevel = (int)compressionlevel;
//else
// stamper1.Writer.SetFullCompression();
//stamper1.SetFullCompression();
//stamper1.Close();
//byte[] newfile = outputPdfStream1.ToArray();
//return newfile;
PdfReader reader = new PdfReader(pdf);
PdfStamper stamper = new PdfStamper(reader, outputPdfStream1,PdfWriter.VERSION_1_5);
int level = (int)compressionlevel;
if (level <= 9)
{
stamper.Writer.CompressionLevel = level;
}
else
stamper.Writer.SetFullCompression();
int total = reader.NumberOfPages + 1;
for (int i = 1; i < total; i++)
{
reader.SetPageContent(i, reader.GetPageContent(i));
}
stamper.SetFullCompression();
stamper.Close();
byte[] newfile = outputPdfStream1.ToArray();
return newfile;
}
}
If I comment stamper.SetFullCompression(); then this method is returning same size of byte array irrespective of the compression level am passing(am passing from 0 to 9 in each case)..
If I uncomment stamper.SetFullCompression(); this method is returning fully compressed version of the input byte irrespective of the compression level!!!
What is the purpose/difference of stamper.SetFullCompression(); and stamper.Writer.SetFullCompression();?
What is the correct way to use different compression levels so that I can see different sizes in each case?
You have a couple of questions and I'll try my best to answer them.
PdfStamper is a helper class that ultimately uses another class called PdfStamperImp to do most of the work. PdfStamperImp is derived from PdfWriter and when you use stamper.Writer you are actually getting back this implementation class. Many of the properties on PdfStamper also pass directly through to the implementation class. So these two calls actually do the same thing.
stamper.SetFullCompression();
stamper.Writer.SetFullCompression();
Another point of confusion is that SetFullCompression and the CompressionLevel aren't actually related at all. "Full compression" represents a feature that was added in PDF 1.5 called "Object Streams" that allows grouping PDF objects together to potentially allow for greater compression. There's actually no requirement that what we think of as "compression" actually occurs but in reality I think it would always happen. (Possibly a super simple document might get larger with this enabled, not sure and don't feel like testing.)
The CompressionLevel is actually what you normally think of as compression, a number from 0 to 9 or -1 to mean default (which currently equals six I think). This property is actually part of the PdfStream class which many classes ultimately derive from. This setting doesn't "trickle down", however. Since you are importing a stream from another location via GetPageContent() and SetPageContent() that specific stream has its own compression settings unrelated to the Writer's compression settings. There's actually a third parameter that you can pass to SetPageContent() to set your specific compression level if you want.
reader.SetPageContent(1, reader.GetPageContent(1), PdfStream.BEST_COMPRESSION);

Use iTextSharp to save a PDF to a SQL Server 2008 Blob, and read that Blob to save to disk

I'm currently trying to use iTextSharp to do some PDF field mapping, but the challenging part right now is just saving the modified file in a varbinary[max] column. Then I later need to read that blob and convert it into a pdf which I save to a file.
I've been all over looking at example code but I can't find exactly what I'm looking for, and can't seem to piece together the [read from file to iTextSharp object] -> [do my stuff] -> [convert to varbinary(max)] pipeline, nor the conversion of that blob back into a savable file.
If anyone has code snippet examples that would be extremely helpful. Thanks!
The need to deal with a pdf in multiple passes was not immediately clear when I first started working them, so maybe this is some help to you.
In the method below, we create a pdf, render it to a byte[], load it for post processing, render the pdf again and return the result.
The rest of your question deals with getting a byte[] into and out of a varbinary[max], saving a byte[] to file and reading it back out, which you can google easily enough.
public byte[] PdfGeneratorAndPostProcessor()
{
byte[] newPdf;
using (var pdf = new MemoryStream())
using (var doc = new Document(iTextSharp.text.PageSize.A4))
using (PdfWriter.GetInstance(doc, pdf))
{
doc.Open();
// do stuff to the newly created doc...
doc.Close();
newPdf = pdf.GetBuffer();
}
byte[] postProcessedPdf;
var reader = new PdfReader(newPdf);
using (var pdf = new MemoryStream())
using (var stamper = new PdfStamper(reader, pdf))
{
var pageCount = reader.NumberOfPages;
for (var i = 1; i <= pageCount; i++)
{
// do something on each page of the existing pdf
}
stamper.Close();
postProcessedPdf = pdf.GetBuffer();
}
reader.Close();
return postProcessedPdf;
}

Categories

Resources