How to determine where the content ends in pdf using PdfSharp

How to determine where the content ends in pdf using PdfSharp - c#

I am creating pdf file using PdfSharp.
I have one fix existing pdf file with one page and some content in it. I want to start adding my contents right after the existing content.
Is there anything in PdfSharp from which I can determine where the last element or section ends?
Any reference will be great.

Please see my answer to this question How to add DocumentLink to existing PDF file using PdfSharp. You can simply take the AddDocumentLink code and get the highest XY coordinate and that should be enough to figure where to append additional content.

Related

Skip section of PDF when extracting text and Get image coordinates

Currently, I can extract all the text chunks with their location data from a PDF. The problem is that the PDF contains images with text annotations which I do not want including in the extraction.
However, for whatever reason whenever I search the PDF for images, it only finds 1 of the images and usually throws the exception: The colour space is not supported. It's as if it doesn't recognise them as images?
I am not wishing to extract the images, just locate where they start and end in relation to the PDF so I can exempt the text that is on top of the images.
For example:
Where the numbers on the graph are unwanted and need to be removed from the extracted text.
Im just not sure how to:
A) Locate all the images and store the coordinates of where it starts and ends
B) Ignore the text that is on top of the images in the PDF document
(I am using iTextSharp to try and achieve this, but so far I am not having much luck)

I'm not exactly sure how iTextSharp works but the PostScript language reference or the PDF Reference manuals may be a good place to start figuring out what you need to know.
I just cracked open a PDF file in a text editor to check out the format because I haven't seen it in a while and then realized what the problem might be.
PDFs support "Images", and "Stream Objects" which can contain image data. Stream objects actually declare enough information that you can know where they begin and end and write something to manually ignore them.
A Stream Object Header looks like this:
<</Intent/RelativeColorimetric/Subtype/Image/Length 19678/Filter/DCTDecode/Name/X/Metadata 4314 0 R/BitsPerComponent 8/ColorSpace 5247 0 R/Width 290/Height 372/Type/XObject>>stream
It's entirely possible that your particular PDF has only one "Image" and then the rest of it is "Streams".
I suggest cracking it open to take a look. It would also be beneficial if you included some sample code with on the library you're using.
I also found by opening a PDF in a text editor this string /Type /Page which seems to create new pages, so you there's a chance you could count those to determine which page you're currently on.
The header at the top of the document I'm reviewing is %PDF-1.2 and the latest version is 1.7, so there may be some disparity here because of that.
Any chance you can share the PDF file you're working with?

itext create pdf based on existing one with changed content

I got quite complicated ready pdf file. It has got barcode and fancy looking table.
I have to create based on it application which will generate pdfs that will look the same but contain different records in the table and different barcode.
Is it possible to copy existing pdf and just change content of barcode and table ?
What would be the best approach to create the same looking pdf but with different content ?
Whank You very much for help

If the barcode and table are static I would open it in photoshop or illustrator delete everything I dont want, Then save it as a pdf again. Then follow this guide iText - add content to existing PDF file and use it as a template to put my custom content in.
If the table and bar code are dynamically generated (each one is different) and you need to crop out content on the fly I would pull some hacky crap and draw white squares over all the content I want gone. then proceed to use it as a template.
Just my 2 cents given the information provided.

VSTO: Is there any way to retrieve the original file name of a picture (InlineShape) that has been inserted into a document?

I am developing a Word Addin. There is a piece of functionality within the Addin that is required to retireive the original location of a picture that has been inserted into a document.
It doesn't matter if the Image file no longer exists in the original directory. I will handle that in the code.

I think there is no way to do this. I did had the same requirement to find the file name from the image in the document. So I had to insert the image with the file name in its alternative text description to achieve this.

The question got me curious, so I tried the following: add an image to a word document, save it, zip it and start looking into the xml document. The media folder contains the image as embedded in word, which at that point has been renamed and "forgot" about its origin. On the other hand, document.xml does contain a lot of information about the image enclosed in the tag, and that includes the whole path to the original picture.
I don't know if the Open XML SDK gives you directly access to this (doubt it), but worst case you should be able to get to it by digging into the file, assuming you are working with an already-saved file.
If the file is not saved yet, I don't know.

I know this is years old, but the full path of an image that has been drag&dropped into a document is available in the AlternativeText field of the InlineShape. Unfortunately you cannot get this value when it has been inserted with Insert Picture. Images that have been pasted probably vary on whether this is available, e.g. if it was pasted from a document where it was drag&dropped it's probably there, but otherwise it isn't.
This info comes from targeting Word 2010 with VSTO.

iTextSharp or XSL-FO to create a PDF dynamically with fillable forms?

This is my first stackoverflow question.
After days of research, I am still lost on how this can be done, if its even possible.
I am trying to create a PDF document using either iTextSharp or XSL-FO (FO.NET is what I am using currently). Creating the documnet is no problem. I need this documnet to have fields that the user can still fill in.
I am aware of the ability to create a PDF form using acrobat, then using iTextSharp to fill in those fields. This can then be saved and the user can open the document and edit it.
The problem with this is, anytime the PDF "template" needs to be changed, someone has to edit the PDF document, then change the backend logic to handle the new field.
I am looking for a 100% dynamic solution.
Ideally I would use XSL and FO to create this document without the need for an exisiting PDF document. I have found no way to create a fillable form using FO.NET, or even iTextSharp, without already having an exisiting PDF "template".
Thanks in advance.

I believe both the RenderX XEP and Antenna House FO processors support PDF Forms. They aren't free and additional output modules may be required for PDF Forms.

Consolidate the same image used multiple times in a PDF

I am generating PDF documents using DevExpress XtraReports.
I am using the same image over and over (in rows of status lights).
The PDF generated seems to duplicate the image definition for each image included. I would prefer if it included the image once and referenced it wherever it needed another copy - this would drastically reduce the size of my PDF docs.
Is there any way to achieve this using DevExpress or even post processed via a third party application. Any help is appreciated.

Two options:
OPT1: I suppose your image is a background or a company logo and the image is the same on all the pages of the pdf. If yes, then create the pdf without the image. Post-process the pdf and add the image on all the pages (you can do that using itext/itextsharp or pdflib).
OPT2: take your actual pdf and convert it using Ghoscript. Using Ghsoscript you can do a "pdf to pdf" conversion. During the conversion Ghostscript try to identify repeated images and removes them. The resulting file is smaller. (Ghostscript is not always able to do that... try with you pdf file).

It is possible to re-use the same image content in multiple locations throughout your document. But it's a fair bit easier to do this while adding the image(s) to the PDF.
I'm not sure if DevExpress supports this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to determine where the content ends in pdf using PdfSharp - c#

Please see my answer to this question How to add DocumentLink to existing PDF file using PdfSharp. You can simply take the AddDocumentLink code and get the highest XY coordinate and that should be enough to figure where to append additional content.

Related

Skip section of PDF when extracting text and Get image coordinates

itext create pdf based on existing one with changed content

VSTO: Is there any way to retrieve the original file name of a picture (InlineShape) that has been inserted into a document?

iTextSharp or XSL-FO to create a PDF dynamically with fillable forms?

Consolidate the same image used multiple times in a PDF

Categories

Resources