At my work sometimes I have to merge from few to few hundreds pdf files. All the time I've been using Writer and ImportedPages classes. But when I have merged all files into one, file size becomes enormous, sum of all merged files sizes, because fonts being attached to every page, and not reused (fonts are embedded to every page, not whole document).
Not very long time ago I found out about PdfSmartCopy class, which reuses embedded fonts and images. And here the problem kicks in. Very often, before merging files together, I have to add additional content to them (images, text). For this purpose I usually use PdfContentByte from Writer object.
Document doc = new Document();
PdfWriter writer = PdfWriter.GetInstance(doc, new FileStream("C:\test.pdf", FileMode.Create));
PdfContentByte cb = writer.DirectContent;
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
When I do similar thing with PdfSmartCopy object, pages are merged, but no additional content being added. Full code of my test with PdfSmartCopy:
using (Document doc = new Document())
{
using (PdfSmartCopy copy = new PdfSmartCopy(doc, new FileStream(Path.GetDirectoryName(pdfPath[0]) + "\\testas.pdf", FileMode.Create)))
{
doc.Open();
PdfContentByte cb = copy.DirectContent;
for (int i = 0; i < pdfPath.Length; i++)
{
PdfReader reader = new PdfReader(pdfPath[i]);
for (int ii = 0; ii < reader.NumberOfPages; ii++)
{
PdfImportedPage import = copy.GetImportedPage(reader, ii + 1);
copy.AddPage(import);
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
doc.NewPage();// net nesessary line
//ColumnText col = new ColumnText(cb);
//col.SetSimpleColumn(100,100,500,500);
//col.AddText(new Chunk("wdasdasd", PdfFontManager.GetFont(#"C:\Windows\Fonts\arial.ttf", 20)));
//col.Go();
}
}
}
}
}
Now I have few questions:
Is it possible to edit PdfSmartCopy object's DirectContent?
If not, is there another way to merge multiple pdf files into one not increasing its size dramatically and still being able to add additional content to pages while merging?
First this: using PdfWriter/PdfImportedPage is not a good idea. You throw away all interactive features! Being the author of iText, it's very frustrating to so many people making the same mistake in spite of the fact that I wrote two books about this, and in spite of the fact that I convinced my publisher to offer one of the most important chapters for free: http://www.manning.com/lowagie2/samplechapter6.pdf
Is my writing really that bad? Or is there another reason why people keep on merging documents using PdfWriter/PdfImportedPage?
As for your specific questions, here are the answers:
Yes. Download the sample chapter and search the PDF file for PageStamp.
Only if you create the PDF in two passes. For instance: create the huge PDF first, then reduce the size by passing it through PdfCopy; or create the merged PDF first with PdfCopy, then add the extra content in a second pass using PdfStamper.
Code after using Bruno Lowagie answer
for (int i = 0; i < pdfPath.Length; i++)
{
PdfReader reader = new PdfReader(pdfPath[i]);
PdfImportedPage page;
PdfSmartCopy.PageStamp stamp;
for (int ii = 0; ii < reader.NumberOfPages; ii++)
{
page = copy.GetImportedPage(reader, ii + 1);
stamp = copy.CreatePageStamp(page);
PdfContentByte cb = stamp.GetOverContent();
cb.Rectangle(100, 100, 100, 100);
cb.SetColorStroke(BaseColor.RED);
cb.SetColorFill(BaseColor.RED);
cb.FillStroke();
stamp.AlterContents(); // don't forget to add this line
copy.AddPage(page);
}
}
2.Only if you create the PDF in two passes. For instance: create the huge PDF first, then reduce the size by passing it through PdfCopy; or create the merged PDF first with PdfCopy, then add the extra content in a second pass using PdfStamper.
It is much more difficult to use the PdfStamper with a second pass. When your working with lots of data it's far easier to create 1 pdf stamp then append.
PdfCopyFields had worked well for this. Now it doesn't work as of the 5.4.4.0 release which is why I'm here.
Related
I need to remove the first few pages of a PDF file. Apparently, the easiest way to do that is to create a copy of it and not duplicate the unwanted pages. This works, but they look a lot smaller than they should. Any ideas?
How it should look
How it actually looks
private static void ClipSpecificPDF(string input, string output, int pagesToCut)
{
PdfReader myReader = new PdfReader(input);
using (FileStream fs = new FileStream(output, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document doc = new Document())
{
using (PdfWriter myWriter = PdfWriter.GetInstance(doc, fs))
{
//Open the desitination for writing
doc.Open();
//Loop through each page that we want to keep
for (int i = pagesToCut; i < myReader.NumberOfPages; i++)
{
//Add a new blank page to destination document
var PS = myReader.GetPageSizeWithRotation(i);
myWriter.SetPageSize(PS);
doc.NewPage();
//Extract the given page from our reader and add it directly to the destination PDF
myWriter.DirectContent.AddTemplate(myWriter.GetImportedPage(myReader, i + 1), 0, 0);
}
//Close our document
doc.Close();
}
}
}
}
The problem you describe is explained in the FAQ. For instance in the answer to the questions:
How to merge documents correctly?
Why does the function to concatenate / merge PDFs cause issues in some cases?
Using PdfWriter to manipulate PDF documents is a very bad idea. Read chapter 6 of my book to discover why this is a bad idea, and take a look at Table 6.1 to find out which class is a better fit.
In the same chapter, you'll find the SelectPages example. Suppose that you want to create a new PDF containing only page 4 to 8. In that case, you simply use the SelectPages() method and PdfStamper:
PdfReader reader = new PdfReader(src);
reader.SelectPages("4-8");
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create, FileAccess.Write));
stamper.Close();
reader.Close();
By using PdfReader, the page size is preserved, as well as any of the interactive features that may be present.
Your approach is bad because you do not respect the original page size: you copy a document with letter (?) format to a document with A4 pages. If the origin of the page doesn't correspond with the lower-left corner, parts of your document will be invisible. If there are interactive features in your PDF, they will be lost. Of all the possible examples you could have followed, you picked the worst one...
I have a small Problem using iTextSharp and C#.
Context:
I download PDFs and merge them into one huge.
Problem:
On every page the first couple centimeters are just White and the pdf I Import starts after that White chunk.
The end of every page is correct. There is no overlapping or missing objects/text - which you would assume since it has to deal with less space. I think it might get stretched vertically.
So the Import works fine, but it always adds a few centrimeters of White on the top of every page.
It feels like a top-margin. But I can't seem to fix it.
Any ideas?
I appreciate your help. Thanks a lot.
public void method()
{
// needed variables for the pdf-merging part
fs = new FileStream(Variables.destinationFile, FileMode.Create);
writer = PdfWriter.GetInstance(doc, fs);
doc.Open();
doc.SetPageSize(PageSize.A4);
doc.SetMargins(0f, 0f, 0f, 0f);
pdfContent = writer.DirectContent;
byte[] result;
int numPages;
foreach (Tuple<string, string, int> currentTuple in someArray)
try
{
result = client.DownloadData(new Uri(adress + currentTuple.Item1 + ".pdf"));
// read and add the pages to the output file
reader = new PdfReader(result);
numPages = reader.NumberOfPages;
for (int i = 1; i < numPages + 1; i++)
{
doc.NewPage();
page = writer.GetImportedPage(reader, i);
pdfContent.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
catch (Exception e)
{
}
}
doc.Close();
writer.Close();
fs.Close();
}
p.s. why does it always delete my "hi there"? :)
You are using the wrong method to merge documents. Your method throws away all interactivity and does not respect page sizes (which explains the problem you are reporting). Please tell me where you got the inspiration for merging documents this way, so that I can go and spank the person responsible for the example you were using ;-)
The correct way of concatenating documents is explained in chapter 6 of my book.
You can find some more examples here:
ITextSharp PdfCopy use examples
copy pdf form with PdfCopy not working in itextsharp 5.4.5.0
Merge PDFs iTextSharp
itextsharp PdfCopy and landscape pages
...
As you can see, your question has been answered many times before on StackOverflow, in the sense that many people have been using the correct way to merge documents (using PdfCopy) instead of doing it the wrong way (using PdfWriter and AddTemplate()).
In your comment, you say that the method AddPage() doesn't exist in PdfCopy. Let's take a look at the most recent version of that class: PdfCopy.cs
I clearly see:
/**
* Add an imported page to our output
* #param iPage an imported page
* #throws IOException, BadPdfFormatException
*/
public virtual void AddPage(PdfImportedPage iPage) {
Note that recent versions also have an AddDocument() method:
virtual public void AddDocument(PdfReader reader) {
Using this method, you no longer have to loop over all the pages, but you can add all the pages of the PDF being read by PdfReader at once.
If you only want to add a selection of pages, you can use:
virtual public void AddDocument(PdfReader reader, List<int> pagesToKeep) {
Please do not use unofficial versions! The official version can be downloaded here: http://sourceforge.net/projects/itextsharp/files/itextsharp/
iText Group does not take any responsibility regarding old versions of iTextSharp, nor can we be held responsible for forks of our software.
How can I remove page breaks from a pdf, so the output would be a single 'page' PDF? So if a normal page is 400x900 and I have 4 pages, a resulting file would be 1600x900. I previously did this for Tif files (Remove page breaks in multi-page tif to make one long page), but would like to do it with PDF. Could I possibly convert to ps, remove whatever code means 'page break', then convert back to pdf?
This can be done in the iTextSharp library by using a single columned PdfTable and dynamically changing the size of the document dependent upon the number of pages.
You'll of course need a few references to the iTextSharp DLL found here
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.IO;
Here's a simple example:
public static void MergePages()
{
using (PdfReader reader = new PdfReader(#"C:\Users\cmilne\Desktop\AA0081913.pdf"))//Original PDF containing page breaks.
{
int pages = reader.NumberOfPages;
float postProcessPageHeight = 0;
float postProcessPageWidth = 0;
for (int p = 1; p <= bill.PageCount; p++)
{
var size = bill.PdfReader.GetPageSize(p);
postProcessPageHeight += (size.Height);
if (size.Width > postProcessPageWidth)
postProcessPageWidth = (size.Width);
}
var rect = new Rectangle(postProcessPageWidth, postProcessPageHeight);
using (Document document = new Document(rect, 0, 0, 0, 0))
{
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(#"C:\Users\cmilne\Desktop\AA0081913_NEW.pdf", FileMode.Create)); //Declare location\name of new PDF not containing page breaks.
document.Open();
PdfImportedPage page;
PdfPTable table = new PdfPTable(1);
table.WidthPercentage = 100;
for (int i = 1; i <= pages; i++)
{
page = writer.GetImportedPage(reader, i);
table.AddCell(iTextSharp.text.Image.GetInstance(page));
}
document.Add(table);
document.Close();
}
}
}
The ending page size must be smaller than 14400 by 14400. (This is all that iTextSharp allows) An 8 1/2 x 11 PDF at a common resolution would make the max about 18 pages.
Use the iTextSharp C# library. It gives you a lot of options to manipulate PDFs. I've used it before when I had to write an import application for a closed-source document repository. It worked like a charm. The only downside is their documentation is kind of spotty because they want you to purchase their book. You can browser their Java API though for free since its almost identical to the C#, and just play around with it to find the C# version.
iText: http://itextpdf.com/
I'm using iText to generate a PDF document that consists of several copies of almost the same information.
E.g.: An invoice. One copy is given to the customer, another is filed and a third one is given to an accountant for book-keeping.
All the copies must be exactly the same except for a little piece of text that indicates who is the copy to (Customer, Accounting, File, ...).
There are two possible scenarios (I don't know if the solution is the same for both of them):
a) Each copy goes in a different page.
b) All the copies goes in the same page (the paper will have cutting holes to separete copies).
There will be a wrapper or helper class which uses iText to generate the PDF in order to be able to do something like var pdf = HelperClass.CreateDocument(DocuemntInfo info);. The multiple-copies problem will be solved inside this wrapper/helper.
What does iText provides to accomplish this? Do I need to write each element in the document several times in different positions/pages? Or does iText provide some way to write one copy to the document and then copy it to other position/page?
Note: It's a .Net project, but I tagged the question with both java and c# because this qustion is about how to use iText properly the answer will help both laguage developers.
If each copy goes on a different page, you can create a new document and copy in the page multiple times. Using iText in Java you can do it like this:
// Create output PDF
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
// Load existing PDF
PdfReader reader = new PdfReader(templateInputStream);
PdfImportedPage page = writer.getImportedPage(reader, 1);
// Copy first page of existing PDF into output PDF
document.newPage();
cb.addTemplate(page, 0, 0);
// Add your first piece of text here
document.add(new Paragraph("Customer"));
// Copy second page of existing PDF into output PDF
document.newPage();
cb.addTemplate(page, 0, 0);
// Add your second piece of text here
document.add(new Paragraph("Accounting"));
// etc...
document.close();
If you want to put all the copies on the same page, the code is similar but instead of using zeroes in addTemplate(page, 0, 0) you'll need to set values for the correct position; the numbers to use depend on the size and shape of your invoice.
See also iText - add content to existing PDF file — the above code is based on the code I wrote in that answer.
Here's how I see this working.
PdfReader reader = new PdfReader( templatePDFPath );
Document doc = new Document();
PdfWriter writer = PdfWriter.createInstance( doc, new FileOutputStream("blah.pdf" ) );
PdfImportedPage inputPage = writer.getImportedPage( reader, 1 );
PdfDirectContent curPageContent = writer.getDirectContent();
String extraStuff[] = getExtraStuff();
for (String stuff : extraStuff) {
curPageContent.saveState();
curPageContent.addTemplate( inputPage /*, x, y*/ );
curPageContent.restoreState();
curPageContent.beginText();
curPageContent.setTextMatrix(x, y);
curPageContent.setFontAndSize( someFont, someSize );
// the actual work:
curPageContent.showText( stuff );
curPageContent.EndText();
// save the contents of curPageContent out to the file and reset it for the next page.
doc.newPage();
}
That's the bare minimum of work on the computer's part. Quite Efficient, and it'll result in a smaller PDF. Rather than having N copies of that page, with tweaks, you have one copy of that page that's reused on N pages, with little tweaks on top.
You could do the same thing, and use the "x,y" parameters in addTemplate to draw them all on the same page. Up to you.
PS: you'll need to figure out the coordinates for setTextMatrix in advance.
You could also use PDfCopy Or PDfSmartCopy to do this.
PdfReader reader = new PdfReader("Path\To\File");
Document doc = new Document();
PdfCopy copier = new PdfCopy(doc, ms1);
//PdfSmartCopy copier = new PdfSmartCopy(doc, ms1);
doc.Open();
copier.CloseStream = false;
PdfImportedPage inputPage = writer.GetImportedPage(reader, 1);
PdfContentByte curPageContent = writer.DirectContent;
for (int i = 0; i < count; i++)
{
copier.AddPage(inputPage);
}
doc.Close();
ms1.Flush();
ms1.Position = 0;
The difference between PdfCopy and PdfSmartCopy is that PdfCopy copies the entire PDF for each page, while PdfSmartCopy outputs a PDF that internally contains only one copy and all pages reference it, resulting in a smaller file and less bandwidth on a network, however it uses more memory on the server and takes longer to process.
Question 298829 describes how linearizing your PDFs lets them stream page-by-page into the user's browser, so the user doesn't have to wait for the whole document to download before starting to view it. We have been using such PDFs successfully, but now have a new wrinkle: We want to keep the page-by-page streaming, but we also want to insert a fresh cover page at the front of the PDF documents each time we serve them up. (The cover-page will have time-sensitive information, such as the date, so it's not practical to include the cover page in the PDFs on disk.)
To help with this, are there any PDF libraries that can quickly append a cover page to a pre-linearized PDF and yield a streamable, linearized PDF as output? What's of the greatest concern is not the total time to merge the PDFs, but how soon we can start streaming part of the merged document to the user.
We were trying to do this with itextsharp, but it turns out that library can't output linearized PDFs. (See http://itext.ugent.be/library/question.php?id=21) Nonetheless, the following ASP.NET/itextsharp scratch code demonstrates the sort of API we're thinking of. In particular, if itextsharp always output linearized PDFs, something like this might already be the solution:
public class StreamPdf : IHttpHandler
{
public void ProcessRequest(HttpContext context)
{
context.Response.ContentType = "application/pdf";
RandomAccessFileOrArray ramFile = new RandomAccessFileOrArray(#"C:\bigpdf.pdf");
PdfReader reader1 = new PdfReader(ramFile, null);
Document doc = new Document();
// We'll stream the PDF to the ASP.NET output
// stream, i.e. to the browser:
PdfWriter writer = PdfWriter.GetInstance(doc, context.Response.OutputStream);
writer.Open();
doc.Open();
PdfContentByte cb = writer.DirectContent;
// output cover page:
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
Font font = new Font(bf, 11, Font.NORMAL);
ColumnText ct = new ColumnText(cb);
ct.SetSimpleColumn(60, 300, 600, 300 + 28 * 15, 15, Element.ALIGN_CENTER);
ct.AddText(new Phrase(15, "This is a cover page information\n", font));
ct.AddText(new Phrase(15, "Date: " + DateTime.Now.ToShortDateString() + "\n", font));
ct.Go();
// output src document:
int i = 0;
while (i < reader1.NumberOfPages)
{
i++;
// add next page from source PDF:
doc.NewPage();
PdfImportedPage page = writer.GetImportedPage(reader1, i);
cb.AddTemplate(page, 0, 0);
// use something like this to flush the current page to the
// browser:
writer.Flush();
s.Flush();
context.Response.Flush();
}
doc.Close();
writer.Close();
s.Close();
}
}
}
Ideally we're looking for a .NET library, but it would be worth hearing about any other options as well.
You could try GhostScript, I think its possible to stitch PDF's together but dont know about linearizing when it comes to PDF. I have a C# GhostScript Wrapper that can be used with the GhostScript dll directly, I am sure this can be modified to Merge PDFs. contact details at: redmanscave.blogspot.com