Getting PDF page length - c#

In my articles which formatted PDF, one or more pages may be blanked and I want to detect them and remove from PDF file. If I can identify pages that are less than 60 KB, I think I can detect the pages that are empty. Because they're probably empty.
I tried like this:
var reader = new PdfReader("D:\\_test\\file.pdf");
/*
* With reader.FileLength, I can get whole pdf file size.
* But I dont know, how can I get pages'sizes...
*/
for (var i = 1; i <= reader.NumberOfPages; i++)
{
/*
* MessageBox.Show(???);
*/
}

I would do this in 2 steps:
first go over the document using IEventListener to detect which pages are empty
once you've determined which pages are empty, simply create a new document by copying the non-empty pages from the source document into the new document
step 1:
List<Integer> emptyPages = new ArrayList<>();
PdfDocument pdfDocument = new PdfDocument(new PdfReader(new File(SRC)));
for(int i=1;i<pdfDocument.getNumberOfPages();i++){
IsEmptyEventListener l = new IsEmptyEventListener();
new PdfCanvasProcessor(l).processPageContent(pdfDocument.getPage(i));
if(l.isEmptyPage()){
emptyPages.add(i);
}
}
Then you need the proper implementation of IsEmptyEventListener. Which may be tricky and depend on your specific document(s). This is a demo.
class IsEmptyEventListener implements IEventListener {
private int eventCount = 0;
public void eventOccurred(IEventData data, EventType type){
// perhaps count only text rendering events?
eventCount++;
}
public boolean isEmptyPage(){ return eventCount < 32; }
}
step 2:
Based on this example: https://developers.itextpdf.com/examples/stamping-content-existing-pdfs/clone-reordering-pages
void copyNonBlankPages(List<Integer> blankPages, PdfDocument src, PdfDocument dst){
int N = src.getNumberOfPages();
List<Integer> toCopy = new ArrayList<>();
for(int i=1;i<N;i++){
if(!blankPages.contains(i)){
toCopy.add(i);
}
}
src.copyPagesTo(toCopy, dst);
}

Related

Updating existing markup (FreeText Callout) PDF using itext7 .NET

I have a code below to update existing markup (FreeText Callout) PDF using itext7 .NET. It does not appear correctly, but edit it in the bluebeam then it is shown the correct content as this image:
What am I missing?
public void UpdateMarkupCallout()
{
string inPDF = #"C:\in PDF.pdf";
string outPDF = #"C:\out PDF.pdf";
PdfDocument pdfDoc = new PdfDocument(new PdfReader(inPDF), new PdfWriter(outPDF));
int numberOfPages = pdfDoc.GetNumberOfPages();
for (int i = 1; i <= numberOfPages; i++)
{
PdfDictionary page = pdfDoc.GetPage(i).GetPdfObject();
PdfArray annotArray = page.GetAsArray(PdfName.Annots);
if (annotArray == null)
{
continue;
}
int size = annotArray.Size();
for (int x = 0; x < size; x++)
{
PdfDictionary curAnnot = annotArray.GetAsDictionary(x);
if (curAnnot.GetAsString(PdfName.Contents) != null)
{
string contents = curAnnot.GetAsString(PdfName.Contents).ToString();
if (contents != "" && contents.Contains("old content"))
{
curAnnot.Put(PdfName.Contents, new PdfString("new content"));
}
}
}
}
pdfDoc.Close();
}
The attached files: here
The answer is in Java but conversion to C# should be a matter of some easy letter case replacements and small tweaks.
Unfortunately, there is no silver bullet solution here, at least not without significant effort.
1. Partial proper solution
There are several issues here. First, you are only updating /Contents key, while the annotations you are editing also have /RC key which stands for A rich text string (see Adobe XML Architecture, XML Forms Architecture (XFA) Specification, version 3.3) that shall be used to generate the appearance of the annotation. (ISO 32000).
On top of that, the appearance (/AP entry) must be regenerated. as dictated by the specification. This is not what iText is capable of doing at the moment, so you will have to do it yourself.
You need to determine the area where the text must be drawn, taking /RD, or rect diff entry into account.
To create your appearance you can use pdfHTML add-on which would process the rich text representation from /RC into layout elements that you can transfer to an XObject that you can put into /AP.
With the code similar to the following:
PdfDocument pdfDocument = new PdfDocument(new PdfReader("in PDF.pdf"),
new PdfWriter("out PDF.pdf"));
int numberOfPages = pdfDocument.getNumberOfPages();
for (int i = 1; i <= numberOfPages; i++) {
PdfDictionary page = pdfDocument.getPage(i).getPdfObject();
PdfArray annotArray = page.getAsArray(PdfName.Annots);
if (annotArray == null) {
continue;
}
int size = annotArray.size();
for (int x = 0; x < size; x++) {
PdfDictionary curAnnot = annotArray.getAsDictionary(x);
if (curAnnot.getAsString(PdfName.Contents) != null) {
String contents = curAnnot.getAsString(PdfName.Contents).toString();
if (!contents.isEmpty() && contents.contains("old content")) //set layer for a FreeText with this content
{
curAnnot.put(PdfName.Contents, new PdfString("new content"));
String richText = curAnnot.getAsString(PdfName.RC).toUnicodeString();
Document document = Jsoup.parse(richText);
for (Element element : document.select("p")) {
element.html("new content");
}
curAnnot.put(PdfName.RC, new PdfString(document.body().outerHtml()));
Rectangle bbox = curAnnot.getAsRectangle(PdfName.Rect);
Rectangle textBbox = bbox.clone();
// left, top, right, bottom
PdfArray rectDiff = curAnnot.getAsArray(PdfName.RD);
if (rectDiff != null) {
textBbox.applyMargins(rectDiff.getAsNumber(1).floatValue(),
rectDiff.getAsNumber(2).floatValue(),
rectDiff.getAsNumber(3).floatValue(),
rectDiff.getAsNumber(0).floatValue(), false);
}
float leftRectDiff = rectDiff != null ? rectDiff.getAsNumber(0).floatValue() : 0;
float topRectDiff = rectDiff != null ? rectDiff.getAsNumber(1).floatValue() : 0;
List<IElement> elements = HtmlConverter.convertToElements(document.body().outerHtml());
PdfFormXObject appearance = new PdfFormXObject(
new Rectangle(0, 0, bbox.getWidth(), bbox.getHeight()));
Canvas canvas = new Canvas(new PdfCanvas(appearance, pdfDocument),
new Rectangle(leftRectDiff, topRectDiff, textBbox.getWidth(), textBbox.getHeight()));
canvas.setProperty(Property.RENDERING_MODE, RenderingMode.HTML_MODE);
for (IElement ele : elements) {
if (ele instanceof IBlockElement) {
canvas.add((IBlockElement) ele);
}
}
curAnnot.getAsDictionary(PdfName.AP).put(PdfName.N, appearance.getPdfObject());
}
}
}
}
pdfDocument.close();
You would get the result that looks like that:
You can see that the new text is displayed as expected, but the overall visual representation is far from our expectations - the background filling, the borders and the arrows are missing. So to generate the appearance properly you would have to further explore other PDF properties such as /CL (arrow descriptors), /BS (border style), /C (background color) etc. This takes quite some time - reading up on the spec, parsing the relevant entries and applying those in your drawing operations. You can get some inspiration from PdfFormField class implementation.
2. Easy solution without any guarantees
In case you expect the text in your annotation to consist of only one line, be plain Latin text and in general the variability of the input documents is small, you can take the current appearance and assume that the text string will be written there in one chunk (it's the case for your input document).
Note that this is a hacky approach which is prone to many potential errors/bugs.
Sample code:
PdfDocument pdfDocument = new PdfDocument(new PdfReader("in PDF.pdf"),
new PdfWriter("out PDF.pdf"));
int numberOfPages = pdfDocument.getNumberOfPages();
for (int i = 1; i <= numberOfPages; i++) {
PdfDictionary page = pdfDocument.getPage(i).getPdfObject();
PdfArray annotArray = page.getAsArray(PdfName.Annots);
if (annotArray == null) {
continue;
}
int size = annotArray.size();
for (int x = 0; x < size; x++) {
PdfDictionary curAnnot = annotArray.getAsDictionary(x);
if (curAnnot.getAsString(PdfName.Contents) != null) {
String contents = curAnnot.getAsString(PdfName.Contents).toString();
String oldContent = "old content";
if (!contents.isEmpty() && contents.contains(oldContent)) {
String newContent = "new content";
curAnnot.put(PdfName.Contents, new PdfString(newContent));
String richText = curAnnot.getAsString(PdfName.RC).toUnicodeString();
Document document = Jsoup.parse(richText);
for (Element element : document.select("p")) {
element.html(newContent);
}
curAnnot.put(PdfName.RC, new PdfString(document.body().outerHtml()));
PdfStream currentAppearance = curAnnot.getAsDictionary(PdfName.AP).getAsStream(PdfName.N);
String currentBytes = new String(currentAppearance.getBytes(), StandardCharsets.UTF_8);
currentBytes = currentBytes.replace("(" + oldContent + ") Tj", "(" + newContent + ") Tj");
currentAppearance.setData(currentBytes.getBytes(StandardCharsets.UTF_8));
}
}
}
}
pdfDocument.close();
Visual result (as you can see, this is what we want):
3. Non-compliant solution
Another way, which is not compliant with the PDF specification, is to remove /AP entry whatsoever. You can do it in the very same loop with curAnnot.remove(PdfName.AP);. Most major PDF viewers are going to regenerate the appearance themselves. However, my viewer generated the appearance in not the most appealing way:
So as you can see the result will depend on the PDF-viewer and this very well illustrates the reason why PDF specification mandates presence of /AP. Once again, this way is not compliant with the PDF spec .

Adding a group of annotations from one pdf in all pages of another pdf and create a new output pdf

I have created a reader for Input file and one for the Markup file. I am not sure if I should loop through the annotations and then add them one by one to the output or if there is a way to pull all the annotations from the markup file and add them to the input file retaining their x,z coordinates.
I have the below code, and I am not sure what to do at the commented section. The AddAnnotation method only takes PdfAnnotation as input but I am not sure how to convert the PdfDictionary to PdfAnnotaiton.
class Program
{
public static string inputFile = #"E:\pdf-sample.pdf";
public static string markupFile = #"E:\StampPdf.pdf";
public static string outputFile = #"E:\pdf.pdf";
public static PdfReader inputReader = new PdfReader(inputFile);
public static PdfReader markupReader = new PdfReader(markupFile);
static void Main(string[] args)
{
PdfDocument inputDoc = new PdfDocument(inputReader, new PdfWriter(outputFile));
PdfDocument markupDoc = new PdfDocument(markupReader);
int n = inputDoc.GetNumberOfPages();
for (int i = 1; i <= n; i++)
{
PdfPage page = inputDoc.GetPage(i);
PdfDictionary markupPage = markupDoc.GetFirstPage().GetPdfObject();
PdfArray annots = markupPage.GetAsArray(PdfName.Annots);
if(annots != null)
{
for(int j=0; j < annots.Size(); j++)
{
PdfDictionary annotItem = annots.GetAsDictionary(i);
//******
//page.AddAnnotation(?);
//******
}
}
}
inputDoc.Close();
}
}
I tried another variation after I found new GetAnnotations method in iText7. Here the code runs fine but I am not able to open the O/P file and get an error that the file is corrupted. Also when I ran inputDoc.Close() instead of the last line given below, I got an error “Pdf indirect object belongs to other PDF document. Copy object to current pdf document.”
PdfReader ireader = new PdfReader(inputFile);
PdfDocument inputDoc = new PdfDocument(ireader, new PdfWriter(outputFile));
PdfReader mreader = new PdfReader(markupFile);
PdfDocument markupDoc = new PdfDocument(mreader);
var annots = markupDoc.GetFirstPage().GetAnnotations();
if (annots != null)
{
for (int j = 0; j < annots.Count(); j++)
{
inputDoc.GetFirstPage().AddAnnotation(annots[j]);
}
}
ireader.Close();
mreader.Close();
markupDoc.Close();
inputDoc.SetCloseWriter(true);
Maybe try this :
if (annots != null)
{
for (int j = 0; j < annots.Size(); j++)
{
PdfDictionary annotItem = annots.GetAsDictionary(i);
PdfLineAnnotation lineAnnotation = new PdfLineAnnotation(annotItem);
page.AddAnnotation(lineAnnotation);
}
}
If it doesn't work, here is some documentation (unfortunately in Java)
http://developers.itextpdf.com/examples/actions-and-annotations/clone-creating-and-adding-annotations
If you could post Pdf with annotations you wish to copy - maybe I can debug and try something more.

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

I already read all related StackOverflow and haven't find a decent solution to this. I want to open a PDF, get the text (words) and their coordinates then further, add a sticky note to some of them.
Seems to be mission impossible, I'm stucked.
How come this code will correctly find all words in a page (but not their coordinates)?
using (PdfReader reader = new PdfReader(path))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 5; page <= 5; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
Console.WriteLine(text);
}
//txt = sb.ToString();
}
But this one gets coordinates, but for "chunks" that cannot rely they are in proper order.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
//strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
// new MyLocationTextExtractionStrategy("sample", System.Globalization.CompareOptions.None)
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
if (chunk.m_text.Trim() == "MCU_MOSI")
Console.WriteLine("Bingo"); // <-- NEVER HIT
}
//Console.WriteLine(strategy.m_SearchResultsList.ToString()); // strategy.GetResultantText() +
}
This uses a class from this post (little modified by me)
Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
But only finds useless "chunks".
So the question is can with iTextSharp really locate words in page so I can add some sticky notes nearby? Thank you.
It looks like the chunk.m_text only contains one letter at a time which is why it this will never be true:
if (chunk.m_text.Trim() == "MCU_MOSI")
What you could do instead is have each chunk text added to a string and see if it contains your text.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
string str = string.Empty;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
var x = strategy.m_SearchResultsList;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
str += chunk.m_text;
if (str.Contains("MCU_MOSI"))
{
str = string.Empty;
Vector location = chunk.m_endLocation;
Console.WriteLine("Bingo");
}
}
}
Note for the example of the location, I made m_endLocation public.

Read columns of PDF in C# using ItextSharp

In my progam I extracted text from a PDF file and it works well. ItextSharp extracts text from PDF line by line. However, when a PDF file contains 2 columns, the extracted text is not ok as in each line joins two columns.
My problem is: How can I extract text column by column?
Below is my code. PDF files are Arabic. I'm sorry my English is not so good.
PdfReader reader = new PdfReader(#"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i,
new LocationTextExtractionStrategy());
words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
// other things here
}
// other things here
}
You may want to use RegionTextRenderFilter to restrict a column region then use LocationTextExtractionStrategy to extract the text. However this requires prior knowledge to the PDF file your are parsing, i.e. you need information about the column's position and size.
In more details, you need to pass in the coordinates of your column to define a rectangle, then extract the text from that rectangle. A sample will be like this:
PdfReader reader = new PdfReader(#"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
private string GetColumnText(float llx, float lly, float urx, float ury)
{
// reminder, parameters are in points, and 1 in = 2.54 cm = 72 points
var rect = new iTextSharp.text.Rectangle(llx, lly, urx, ury);
var renderFilter = new RenderFilter[1];
renderFilter[0] = new RegionTextRenderFilter(rect);
var textExtractionStrategy =
new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
renderFilter);
var text = PdfTextExtractor.GetTextFromPage(reader, intPageNum,
textExtractionStrategy);
return text;
}
Here is another post discussing what you want, you may want to check as well: iTextSharp - Reading PDF with 2 columns. But they didn't hit the solution either :(

Using iTextSharp to add repeating data to an existing PDF?

I am going to be using iTextSharp to insert data to a PDF that the Graphics department has created. Most of this data is simple data-to-field mapping, but some data is a list of items that needs to be added (e.g. product data; users can have any number of products and the data needs to be displayed for all of them).
Is it possible to do this with iTextSharp? The PDF template cannot, obviously, be created with a certain number of fields as there is no way of knowing how many fields there will be - it could be 1, or 10, or even 100; what I need to be able to do is "re-use" a section of the PDF and repeat that section for each item within a loop.
Is that doable?
In the past I needed to do something similar. I needed to create a PDF with an unknown number of images + content. In my case an 'Entry' was defined by an image and a set of fields.
What I did is I had a doc. that served as a 'Entry' template. I then generated a temp. pdf file for each 'Entry', and stored the generated file names in a List.
After all 'Entries' were processed I then merged all temporary pdf docs, into one final document.
Here is some code to give you a better idea (it's not compilable, just serves as a ref, as I took certain parts from my older project).
List<string> files = new List<string>(); // list of files to merge
foreach (string pageId in pages)
{
// create an intermediate page
string intermediatePdf = Path.Combine(_tempPath, System.Guid.NewGuid() + ".pdf");
files.Add(intermediatePdf);
string pdfTemplate = Path.Combine(_templatePath, _template);
CreatePage(pdfTemplate, intermediatePdf, pc, pageValues, imageMap, tmd);
}
// merge into resulting pdf file
string outputFolder = "~/Output/";
if (preview)
{
outputFolder = "~/temp/";
}
string pdfResult = Path.Combine(HttpContext.Current.Server.MapPath(outputFolder), Guid.NewGuid().ToString() + ".pdf");
PdfMerge.MergeFiles(pdfResult, files);
//////////////////////////////////////////////////////////////////////////
// delete temporary files...
foreach (string fd in files)
{
File.Delete(fd);
}
return pdfResult;
Here is the code to merge the templates:
public class PdfMerge
{
public static void MergeFiles(string destinationFile, List<string> sourceFiles)
{
int f = 0;
// we create a reader for a certain document
PdfReader reader = new PdfReader(sourceFiles[f]);
// we retrieve the total number of pages
int n = reader.NumberOfPages;
// step 1: creation of a document-object
Document document = new Document(reader.GetPageSizeWithRotation(1));
// step 2: we create a writer that listens to the document
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(destinationFile, FileMode.Create));
// step 3: we open the document
document.Open();
PdfContentByte cb = writer.DirectContent;
PdfImportedPage page;
int rotation;
// step 4: we add content
while (f < sourceFiles.Count)
{
int i = 0;
while (i < n)
{
i++;
document.SetPageSize(reader.GetPageSizeWithRotation(i));
document.NewPage();
page = writer.GetImportedPage(reader, i);
rotation = reader.GetPageRotation(i);
if (rotation == 90 || rotation == 270)
{
cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(i).Height);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
f++;
if (f < sourceFiles.Count)
{
reader = new PdfReader(sourceFiles[f]);
// we retrieve the total number of pages
n = reader.NumberOfPages;
}
}
// step 5: we close the document
document.Close();
}
}
Hope it helps!

Categories

Resources