I'm cycling though a bunch of PDFs, and merging them into a single PdfDocument. I load one PDF using XPdfForm.FromStream(stm), then add a new page using AddPage, and draw the XPdfForm to that page. This seems to be the typical way to do this.
Some of these incoming PDFs contain duplicate images. I'd like to eliminate these as they create a file much larger than required.
Given an input XPdfForm, and the desire to draw it onto a PdfPage using an XGraphics... how can I design code that would not add duplicate images, but instead refer to a reusable image?
In an ideal world, PDFsharp would remove all duplicate objects (images, fonts) while saving.
It's on our wishlist.
It seems there was already an implementation for this problem.
http://forum.pdfsharp.net/viewtopic.php?f=4&t=648
I don't know why it was removed, but the old source is still available at SourceForge.
Related
I am trying to add header in existing PDF file with help of PDFPageEventHelper. I am getting error document has no pages in some files. The code works perfectly for other files. While debugging I found that for some files OnEndPage method is not called which may cause to throw exception 'document has no pages'
Any idea why this method or event (OnEndPage) is not called ?
Maybe your PDF files do not contain any page information. PDF renders content into bounding boxes. All you do is to define a box and render stuff into it. Therefore you do not need any page information.
Out there - in da real world - exists a lot of crazy pages. Some declare a box so that (0,0) is in the middle of the box. Perfect for drawing functions, but some libraries fail, because they think only in pages starting with left/top corner as (0,0). And such boxes can be transformed multiple times inside a document.
Many PDF documents contain a lot of parts which break PDF rules. There are some tools on the markets, which will validate your files against the PDF references. A few try to fix them. A cheap workaround may be to read the PDF into libre office and save it again as PDF. This will fix only a small set of errors, but yours may be among them.
You have to read the failing documents in a text editor to find the reason. But it is a pain in the a*
Currently, I can extract all the text chunks with their location data from a PDF. The problem is that the PDF contains images with text annotations which I do not want including in the extraction.
However, for whatever reason whenever I search the PDF for images, it only finds 1 of the images and usually throws the exception: The colour space is not supported. It's as if it doesn't recognise them as images?
I am not wishing to extract the images, just locate where they start and end in relation to the PDF so I can exempt the text that is on top of the images.
For example:
Where the numbers on the graph are unwanted and need to be removed from the extracted text.
Im just not sure how to:
A) Locate all the images and store the coordinates of where it starts and ends
B) Ignore the text that is on top of the images in the PDF document
(I am using iTextSharp to try and achieve this, but so far I am not having much luck)
I'm not exactly sure how iTextSharp works but the PostScript language reference or the PDF Reference manuals may be a good place to start figuring out what you need to know.
I just cracked open a PDF file in a text editor to check out the format because I haven't seen it in a while and then realized what the problem might be.
PDFs support "Images", and "Stream Objects" which can contain image data. Stream objects actually declare enough information that you can know where they begin and end and write something to manually ignore them.
A Stream Object Header looks like this:
<</Intent/RelativeColorimetric/Subtype/Image/Length 19678/Filter/DCTDecode/Name/X/Metadata 4314 0 R/BitsPerComponent 8/ColorSpace 5247 0 R/Width 290/Height 372/Type/XObject>>stream
It's entirely possible that your particular PDF has only one "Image" and then the rest of it is "Streams".
I suggest cracking it open to take a look. It would also be beneficial if you included some sample code with on the library you're using.
I also found by opening a PDF in a text editor this string /Type /Page which seems to create new pages, so you there's a chance you could count those to determine which page you're currently on.
The header at the top of the document I'm reviewing is %PDF-1.2 and the latest version is 1.7, so there may be some disparity here because of that.
Any chance you can share the PDF file you're working with?
I'm being tasked to enhance the way we create custom brochures, the old way we had a legacy system create the needed pdfs and then I would download those and "glue" them into one big pdf.
the new one way we want to go about this is to skip the legacy system and build all of these things from our new system.
The biggest hurdle is the cover, which consists of the background layer, and then the logo layer which has the company logo, a shadowbox and an emblem. all of these objects are pdf documents.
my problem is after I build the logo portion, how will I be able to position the pdf exactly where I need it on the background layer?
this is all being done on the fly so I can't save anything to disk.
any help will be greatly appreciated.
There are several PDFsharp samples that show how to do it:
http://pdfsharp.net/wiki/XForms-sample.ashx
http://pdfsharp.net/wiki/Graphics-sample.ashx#Draw_a_form_XObject_a_page_from_an_external_PDF_file_27
http://pdfsharp.net/wiki/TwoPagesOnOne-sample.ashx
You can draw pages from other PDF files like images on a newly created PDF page. You can specify the exact positions and sizes, you can even transform them (skew them, rotate them).
I got quite complicated ready pdf file. It has got barcode and fancy looking table.
I have to create based on it application which will generate pdfs that will look the same but contain different records in the table and different barcode.
Is it possible to copy existing pdf and just change content of barcode and table ?
What would be the best approach to create the same looking pdf but with different content ?
Whank You very much for help
If the barcode and table are static I would open it in photoshop or illustrator delete everything I dont want, Then save it as a pdf again. Then follow this guide iText - add content to existing PDF file and use it as a template to put my custom content in.
If the table and bar code are dynamically generated (each one is different) and you need to crop out content on the fly I would pull some hacky crap and draw white squares over all the content I want gone. then proceed to use it as a template.
Just my 2 cents given the information provided.
I am generating PDF documents using DevExpress XtraReports.
I am using the same image over and over (in rows of status lights).
The PDF generated seems to duplicate the image definition for each image included. I would prefer if it included the image once and referenced it wherever it needed another copy - this would drastically reduce the size of my PDF docs.
Is there any way to achieve this using DevExpress or even post processed via a third party application. Any help is appreciated.
Two options:
OPT1: I suppose your image is a background or a company logo and the image is the same on all the pages of the pdf. If yes, then create the pdf without the image. Post-process the pdf and add the image on all the pages (you can do that using itext/itextsharp or pdflib).
OPT2: take your actual pdf and convert it using Ghoscript. Using Ghsoscript you can do a "pdf to pdf" conversion. During the conversion Ghostscript try to identify repeated images and removes them. The resulting file is smaller. (Ghostscript is not always able to do that... try with you pdf file).
It is possible to re-use the same image content in multiple locations throughout your document. But it's a fair bit easier to do this while adding the image(s) to the PDF.
I'm not sure if DevExpress supports this.