I want to copy certain elements from one PDF to another using iTextSharp.
I want to read one PDF, read text elements from that and correct them and create a new PDF using the updated text elements and all the images etc. from the first PDF.
Please help me how this can be achieved.
This task is very complex. I wrote a program to do this for a large greeting card maker.
First you have to locate the text and calculate the glyph bounding boxes. Next you have to modify the contents stream to remove the text. The text may be broken into many pieces depending on the PDF creator. You have to remove those operators from the contents stream and adjust the CTM because some operators use relative positioning. Finally, you have to insert the replacement text, matching the original text's style (font, size, color, orientation, etc.)
As for copying elements from one PDF to another, most of the steps above are required plus you have to copy resources, eg. fonts, colorspaces, patterns, etc, to the new PDF.
Related
I recently started a project using PDF sharp + Migra Doc and I encounter a problem which I have seen in other posts, there is no fixing automatically. Table row will be generated on the next page if it doesn't have enough space and if there is still not enough space it will just go into the border and the text is lost. I am thinking of a workaround but I am not sure exactly how it can be done.
My think is as follows:
If I am able to check how many lines of text can fit in cell with the given string I can create a variable and increase it every time I add text. with the excess of text I can simply create a new row (which will be automatically be added on the next page) and thus fixing my problem. Even if I am not counting lines, is there a way to check if the row becomes too large for the current page? If at a given time I can check if the cell is too large and will be automatically sent to the next page I can trim the string up to the point it will fit, save the remaining words that didn't fit and maximise the space within the page.
this is how the document is generated currently
Is there a way to workaround this? That white space is useless and a waste of resources when it comes to a 30-40 pages document.
One extreme option: Make the layout in your code and use PDFsharp to draw the text.
See also:
https://forum.pdfsharp.net/viewtopic.php?f=8&t=3192
A MigraDoc cell can contain a mix of different fonts with different font attributes (regular, bold, ...) and sizes. Measuring the size and creating a new row can become complicated if you mix different fonts, but it can be simple if you only use a single font for your cell.
See also:
https://forum.pdfsharp.net/viewtopic.php?f=8&t=3196
The space problem with tables occur if table rows are rather large (more than just one or two lines of text). Maybe tables are not the best option to present the information. How strict are your requirements? Can you get away from tables?
The solution that finally worked was as follows:
set up the style for the document including the header
depending on the data used create a for-loop which will input the desired rows in the table
top of the loop must add a row in the document
save in a variable how many pages the document currently contains(initially declare as 1 before entering the loop)
clone the document checking if the document you are passing contains the same number or more than the current document. If the document contains more pages means that the row you inputted exceeds the page. I was able to achieve this by rendering the document every time I was adding a new row.
an inner loop is necessary to trim the text within the row. The way I did it is split the text into sentences and if it contains more than 3 sentences trim, otherwise just let go to the next page.
make sure you always delete the last row on the inner loop otherwise you will end up with the same data
It might not be the most efficient way but it renders 30+ pages documents in tables under 2 seconds on Azure servers. I hope this helps someone at some point.
Based on condition I need to hide one section and the section below should move above. So that while generating the PDF the hidden section should not show as blank.
Some clarification:
If you are doing this with an existing pdf, it is not likely to work. Pdf documents are not WYSIWYG format. Think of them more as containers of drawing-instructions than as containers of text.
Moving a section of an existing document will not work because:
the document itself contains no information on what instructions go together to make up lines, paragraphs, and sections
the document uses compression and byte-offsets, moving or deleting part of it would imply that you need to re-calculate all the byte-offsets
If you drop the requirement of re-flowing the text, it is certainly possible. iText already has an add-on for that called pdfSweep which look at all the drawing and rendering operations and removes the ones that intersect with a given rectangle (or adjusts them, for instance when a path goes through the rectangle)
If you are generating the pdf, this is of course trivial. You can simply do something like:
File outputFile = new File(System.getProperty("user.home"),"output.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outputFile));
Document layoutDocument = new Document(pdfDocument);
if(some_condition)
{
layoutDocument.add(new Paragraph("Lorem Ipsum Dolor Sit Amet"));
}
layoutDocument.add(new Paragraph("Never gonna give you up. Never gonna let you down."));
Check out http://itextpdf.com/itext7/pdfsweep
I have a piece of code that stamps a PDF by merging a FormXObject of a source pdf file. What I am trying to do is superimpose some text into that. How does one approach such operation?
Add an image or merge a PDF to an existing document.
Add a text relative to that image/pdf.
Concatenate the text and image into one selectable annotation (ideally a stamp)
I can do step one individually, but superimposing the text is what remains unclear. Think of it as trying to fake a signature using a stamp that contains modifiable text (similar to Adobe's Signature Appearance).
I need to extract text from a pdf document and I am using the iTextSharp library to do so. The issue is that the image has text on it which is not part of the image. I have been looking to find a way to get the coordinates of the image as there are annotations on the image being included in the text extraction:
e.g.
Results in the extraction:
Some text...
Text
Text
More Text..
However, the text in the image is irrelevant and needs to be ignored to give the output:
Some text...
More text...
Another issue is that there are multiple pages with this issue with images all being different sizes, however, all the text is always within the image bounds which is why if I have a way of determining the height and x, y coordinates of the image relative to the page I can extract the necessary data.
Currently, I need to just get the text but I will also need to extract the images at a later date.
While recognizing the characters in a TIFF image,able to read the characters by using OCR method,MODI but where as while reading specific text format, it doesn't read the exact characters and shows some unknown characters.
Suppose this is the text format in an TIFF image file, how can I read the characters clearly in the below image?
Any way to recognize and display exact characters again?
Image analysis and OCR is always kind of a soft science, since it might work on one instance and fail in another.
Can you apply some filters before performing the OCR? You might try to blur the image beforehand to soften the impact of the dotted background and then perform OCR on the image.