Extract font height and rotation from PDF files with iText/iTextSharp

Extract font height and rotation from PDF files with iText/iTextSharp - c#

I created some code to extract text and font height from a PDF file using iTextSharp, but does not handle text rotation. How can that information be extracted/computed?
Here is the code:
// Create PDF reader
var reader = new PdfReader("myfile.pdf");
for (var k = 1; k <= reader.NumberOfPages; ++k)
{
// Get page resources
var page = reader.GetPageN(k);
var pdfResources = page.GetAsDict(PdfName.RESOURCES);
// Create custom render listener, processor, and process page!
var listener = new FunnyRenderListener();
var processor = new PdfContentStreamProcessor(listener);
var bytes = ContentByteUtils.GetContentBytesForPage(reader, k);
processor.ProcessContent(bytes, pdfResources);
}
[...]
public class FunnyRenderListener : IRenderListener
{
[...]
void RenderText(TextRenderInfo renderInfo)
{
// Get text
var text = renderInfo.GetText();
// Get (computed) font size
var bottomLeftPoint = renderInfo.GetDescentLine().GetStartPoint();
var topRightPoint = renderInfo.GetAscentLine().GetEndPoint();
var rectangle = new Rectangle(
bottomLeftPoint[Vector.I1], bottomLeftPoint[Vector.I2],
topRightPoint[Vector.I1], topRightPoint[Vector.I2]
);
var fontSize = Convert.ToDouble(rectangle.Height);
Console.WriteLine("Text: {0}, FontSize: {1}", text, fontSize);
}
}

The information you need, i.e. the text rotation, is not directly available via a TextRenderInfo member but it does have the method
/**
* Gets the baseline for the text (i.e. the line that the text 'sits' on)
* This value includes the Rise of the draw operation - see getRise() for the amount added by Rise
*/
public LineSegment GetBaseline()
Most likely by text rotation you mean the rotation of this line against a horizontal one. Doing some easy math, therefore, you can calculate the rotation from this LineSegment.
PS: Looking at your code you actually already use the ascent line and descent line. You can use any of these lines as well instead of the base line.

Related

How to get uniform line space for a mixed paragraph of texts and images

I am using iText 7.2.1.
I am trying to add some small icons (drawn by code) in my text. I find if small icons are added into my text, it's hard to have uniform line space.
If all elements of a paragraph are texts, I can just set SetFixedLeading() then no matter how big the font sizes are, my lines have always the same height.
But when I add some small icons inside my paragraph, SetFixedLeading() no longer works.
What I want is like the "Line spacing" option in Microsoft Word. If I give it a fixed value, it treats embedding images and texts equally so I always get fixed line spacing.
The following is my code:
using iText.Kernel.Colors;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas;
using iText.Layout;
using iText.Kernel.Pdf.Xobject;
using iText.Layout.Element;
using iText.Kernel.Geom;
using iText.Kernel.Font;
using iText.IO.Font;
namespace iTextTest
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
var writer = new PdfWriter("test.pdf");
var pdf_doc = new PdfDocument(writer);
var doc = new Document(pdf_doc, iText.Kernel.Geom.PageSize.DEFAULT, false);
// Make a text of various sizes
var mixed_paragraph = new Paragraph();
for (int i = 0; i < 100; i ++)
{
var style = new Style();
var size = (Math.Sin(i) + 2) * 10;
style.SetFontSize((float)size);
mixed_paragraph.Add(new Text("A").AddStyle(style));
}
// Make a 20x20 icon
var bounds = new iText.Kernel.Geom.Rectangle(0, 0, 20, 20);
var xobj = new PdfFormXObject(bounds);
var pdf_canvas = new PdfCanvas(xobj, pdf_doc);
pdf_canvas.SetFillColor(ColorConstants.RED);
pdf_canvas.Rectangle(0, 0, 20, 20);
pdf_canvas.Fill();
var icon = new iText.Layout.Element.Image(xobj);
mixed_paragraph.Add(icon);
// Fixed leading
mixed_paragraph.SetFixedLeading(10);
doc.Add(mixed_paragraph);
doc.Close();
pdf_doc.Close();
writer.Close();
MessageBox.Show("OK");
}
}
}
This is what it looks like. The second line is right but the third line has more space than fixed leading 10.
I need this because, in my case, I need some small rectanglular icons that each contain two lines of integers and other info.
These icons have bigger height than my text (or else it's hard to read), but theoretically they can still fit because my text has enough spacing.
Unfortunately, my line spaces become uneven. Fixed leading seems not affecting non-text images, so lines with icons have wider line spaces.
I have been considering a workaround: add empty spaces in text and put icons at these fixed positions. It's still hard. I don't know how to get these positions.

Resize Page Height and Width with image inside c#

I use Aspose.Word. When you try to resize the page, everything changes. BUT the images go beyond the boundaries of the text space.
There are several images in the document and I have no idea how to fix it.
`
var input = #"d:\1.docx";
var output = #"d:\2.docx";
Document doc = new Document(input);
DocumentBuilder builder = new DocumentBuilder(doc);
if (project.Variables["flagsize"].Value=="69")
{
builder.PageSetup.PageWidth = ConvertUtil.MillimeterToPoint(152.4);
builder.PageSetup.PageHeight = ConvertUtil.MillimeterToPoint(228.6);
Node[] runs = doc.GetChildNodes(NodeType.Run, true).ToArray();
for (int j = 0; j < runs.Length; j++)
{ Run run = (Run)runs[j];
run.Font.Size = 18;
}
}
foreach (Section section in doc)
{
section.PageSetup.PaperSize = Aspose.Words.PaperSize.Custom;
section.PageSetup.LeftMargin= ConvertUtil.MillimeterToPoint(22);
section.PageSetup.RightMargin= ConvertUtil.MillimeterToPoint(22);
}
doc.Save(output);
`
Try to find correct method of word.
Expecting all images at doc will be right dimensions
I think this code i need:
foreach (Aspose.Words.Drawing.Shape shape in doc)
{
shape.Width ...
}
But i have error :
Не удалось привести тип объекта "Aspose.Words.Section" к типу "Aspose.Words.Drawing.Shape".

To get all shapes in the document, you can use Document.GetChildNodes method passing the appropriate NodeType as a parameter. For example the following code returns all shapes in the document:
NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
You can use LINQ to filter the collection, for example the following code returns shapes that has an image:
List<Shape> shapes = doc.GetChildNodes(NodeType.Shape, true)
.Cast<Shape>().Where(s => s.HasImage).ToList();
It looks like your requirement is to fit the image to image size. I think the example provided here might be useful for you. In the provided example an image is instead into the document and page is adjusted to the actual image size. Then the result document is converted to PDF.

NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
PageSetup page_Setup = doc.FirstSection.PageSetup;
foreach (Shape shape in shapes)
{
shape.HorizontalAlignment = HorizontalAlignment.Center;
shape.Width = page_Setup.PageWidth - page_Setup.LeftMargin - page_Setup.RightMargin;
}

iText 7 ImageRenderInfo Matrix contains negative height on Even number Pages

I have a PDF with four pages. Two images on the first page, one on the second, and one on the third. When I retrieve the value of the image on the second page or fourth,, I get a negative height. I tried setting it to Absolute as a quick fix but the Y position of the image was still slightly off. Also, the height and positioning on page three was fine.
Update: So far, this only seems to be a problem with PDF's created in Google Docs.
My code to extract the PDF images was taken from this thread Using iText 7, what's the proper way to export a Flate encoded image?.
This is how I access the height
var currentPDFImageInfo = extractedImages[i];
var currentPDFImageMatrix = currentPDFImageInfo.RenderInfo.GetImageCtm();
float pdfImageWidth = currentPDFImageMatrix.Get(iText.Kernel.Geom.Matrix.I11);
How I retrieve the PDF image data
public static List<PDFImageInfo> ExtractImagesFromPDF(string filePath)
{
Reader = new PdfReader(filePath);
Document = new PdfDocument(Reader);
var strategy = new ImageRenderListener();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int pageNumber = 1; pageNumber <= Document.GetNumberOfPages(); pageNumber++)
{
strategy.CurrentPageNumber = pageNumber;
parser.ProcessPageContent(Document.GetPage(pageNumber));
}
return strategy.ImageInfoList;
}
And of course the Strategy class
public class ImageRenderListener : IEventListener
{
public void EventOccurred(IEventData data, EventType type)
{
if (data is ImageRenderInfo imageData)
{
try
{
if (imageData.GetImage() == null)
{
Console.WriteLine("Image could not be read.");
}
else
{
var pdfImageInfo = new PDFImageInfo(CurrentPageNumber, imageData);
ImageInfoList.Add(pdfImageInfo);
}
}
catch (Exception ex)
{
Console.WriteLine("Image could not be read: {0}.", ex.Message);
}
}
}
public ICollection<EventType> GetSupportedEvents()
{
return null;
}
public int CurrentPageNumber { get; set; }
public List<PDFImageInfo> ImageInfoList { get; set; } = new List<PDFImageInfo>();
}

This is how I access the height
var currentPDFImageInfo = extractedImages[i];
var currentPDFImageMatrix = currentPDFImageInfo.RenderInfo.GetImageCtm();
float pdfImageWidth = currentPDFImageMatrix.Get(iText.Kernel.Geom.Matrix.I11);
This value is the height only under certain circumstances.
Some backgrounds: The contents of a PDF page are drawn by a sequence of instructions in some content stream. Some of these instructions can manipulate the so called current transformation matrix (CTM) which represents an affine transformation, i.e. some combination of a rotation, translation, mirroring, and skewing. Everything other instructions draw is manipulated by the CTM value at the time that instruction is executed.
When a bitmap image is drawn, it is conceptually first reduced to a 1×1 square which then is transformed by the CTM to the final form of the image on the page.
If the image is displayed upright, no rotation or anything else involved, then indeed the I11 value is the width of the displayed image and the I22 value is the height. The I12 and I21 values are 0 then
But often bitmaps are displayed at 90° clockwise or counterclockwise (e.g. because someone held the camera at an 90° angle while shooting). In these cases I11 and I22 are 0 while I12 and I21 are the height and width respectively, with one or the other having a negative sign depending on the direction of the rotation.
If the bitmap is rotated by 180°, I11 and I22 again contain width and height, but both with a negative sign. If it's mirrored along the x axis or the y axis, one of them is negative.
And if the transformation is something else, e.g. a rotation by an angle that's not a multiple of 90°, finding the height and width becomes more complicated.
Actually then it is not even clear what height and width of the skewed, rotated, and mirrored form shall mean.
Thus, as a start please define which values you exactly are after; based on that you can try and determine them from arbitrary transformation matrices.
Another possible cause for unexplainable coordinate data for pages after the first one is that your code re-uses the PdfCanvasProcessor for each page without resetting:
var strategy = new ImageRenderListener();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int pageNumber = 1; pageNumber <= Document.GetNumberOfPages(); pageNumber++)
{
strategy.CurrentPageNumber = pageNumber;
parser.ProcessPageContent(Document.GetPage(pageNumber));
}
This causes the graphics state at the end of one page incorrectly to be used as starting graphics state of the next one. Instead you should either use a new PdfCanvasProcessor instance for each page or call parser.Reset() at the start of each page.

What value to use for .MoveUp of canvas

The code below copies all pages from a PDF file to a new file and inserts on the first page a rectangle at the top with a red border holding a short text.
If I don't move it, a gap will be left at the top (here enlarged a lot, font size is 8 only):
However, if I move the rectangle up by an empiric value of 4:
iText.Kernel.Geom.Rectangle pageSize = firstPage.GetCropBox().MoveUp(4);
there will be a perfect match at the top:
The value 4 is not related to the font size.
I dislike magic numbers in code so, my question is: Why 4? What expression would reveal this value of 4?
The code line is in the first method here. The second is where it is used, and third is called from the first; it just supplies a style:
private static void RegisterDocument(PdfDocument pdfDocument, string registration)
{
// Magic value to close gap between top of page and top of rectangle with the registration.
const float moveUp = 4F;
Document document = new Document(pdfDocument, new PageSize(PageSize.A4));
PdfPage firstPage = document.GetPdfDocument().GetFirstPage();
Paragraph paragraph = new Paragraph(registration).AddStyle(RegistrationStyle());
iText.Kernel.Geom.Rectangle pageSize = firstPage.GetCropBox().MoveUp(moveUp);
LayoutContext layoutContext = new LayoutContext(new LayoutArea(1, pageSize));
IRenderer renderer = paragraph.CreateRendererSubTree();
renderer.SetParent(document.GetRenderer()).Layout(layoutContext);
Canvas canvas = new Canvas(new PdfCanvas(firstPage, true), pageSize);
canvas.Add(paragraph);
document.Close();
}
public static void RegisterPdf(string sourceFilename, string targetFilename, string registration)
{
if (registration.Length > 0)
{
// Open source and target PDF files.
PdfDocument sourcePdf = new PdfDocument(new PdfReader(sourceFilename));
PdfDocument targetPdf = new PdfDocument(new PdfWriter(targetFilename));
// Copy all pages from source PDF to target PDF.
sourcePdf.CopyPagesTo(1, sourcePdf.GetNumberOfPages(), targetPdf);
// Add registration to page 1 of target and save the document.
RegisterDocument(targetPdf, registration);
// Close the files.
sourcePdf.Close();
targetPdf.Close();
}
}
private static Style RegistrationStyle()
{
// Fixed design values for font and rectangle.
PdfFont font = PdfFontFactory.CreateFont(StandardFonts.HELVETICA);
const float fontSize = 8F;
const float rightPadding = 3F;
TextAlignment textAlignment = TextAlignment.RIGHT;
iText.Kernel.Colors.Color borderColor = ColorConstants.RED;
iText.Kernel.Colors.Color fillColor = ColorConstants.WHITE;
const float borderWidth = 0.7F;
Style style = new Style()
.SetFont(font)
.SetFontSize(fontSize)
.SetPaddingRight(rightPadding)
.SetTextAlignment(textAlignment)
.SetBackgroundColor(fillColor)
.SetBorder(new SolidBorder(borderColor, borderWidth));
return style;
}

You wonder
I dislike magic numbers in code so, my question is: Why 4? What expression would reveal this value of 4?
iText, when calculating the layout of some entity, retrieves properties from multiple sources, in particular the entity itself and its renderer. And it does not only ask them for explicitly set properties but also for defaults.
In the case at hand you see the default top margin value of the Paragraph class at work:
public override T1 GetDefaultProperty<T1>(int property) {
switch (property) {
case Property.LEADING: {
return (T1)(Object)new Leading(Leading.MULTIPLIED, childElements.Count == 1 && childElements[0] is Image ?
1 : 1.35f);
}
case Property.FIRST_LINE_INDENT: {
return (T1)(Object)0f;
}
case Property.MARGIN_TOP:
case Property.MARGIN_BOTTOM: {
return (T1)(Object)UnitValue.CreatePointValue(4f);
}
case Property.TAB_DEFAULT: {
return (T1)(Object)50f;
}
default: {
return base.GetDefaultProperty<T1>(property);
}
}
}
(iText Layout Paragraph method)
If you set the top margin of your paragraph to 0, you can simplify your code considerably:
public static void RegisterPdfImproved(string sourceFilename, string targetFilename, string registration)
{
using (PdfDocument pdf = new PdfDocument(new PdfReader(sourceFilename), new PdfWriter(targetFilename)))
using (Document document = new Document(pdf))
{
document.SetMargins(0, 0, 0, 0);
Paragraph paragraph = new Paragraph(registration)
.AddStyle(RegistrationStyle())
.SetMarginTop(0);
document.Add(paragraph);
}
}
Without any magic values you now get

There is no way to tell, without seeing all the details of your code. It could depend on an arbitrary number of circumstances and combinations of them. Examples:
default values in the PDF library you are using
margins defined in the document

Removing Text based watermarks using itextsharp

According to this post (Removing Watermark from PDF iTextSharp) , #mkl code works fine for ExGstate graphical watermarks but I have tested this code to remove watermark from some files which have Text based watermarks behind PDF contents (like this file : http://s000.tinyupload.com/index.php?file_id=05961025831018336372)
I have tried multiple solutions that found in this site but get no success.
Can anyone help to remove this watermark types by changing above #mkl solution?
thanks

Just like in the case of the question the OP references (Removing Watermark from PDF iTextSharp), you can remove the watermark from your sample file by building upon the PdfContentStreamEditor class presented in my answer to that question.
In contrast to the solution in that other answer, though, we do not want to hide vector graphics based on some transparency value but instead the writing "Archive of SID" from this:
First we have to select a criterion to recognize the background text by. Let's use the fact that the writing is by far the largest here. Using this criterion makes the task at hand essentially the iTextSharp/C# pendant to this iText/Java solution.
There is a problem, though: As mentioned in that answer:
The gs().getFontSize() used in the second sample may not be what you expect it to be as sometimes the coordinate system has been stretched by the current transformation matrix and the text matrix. The code can be extended to consider these effects.
Exactly this is happening here: A font size of 1 is used and that small text then is stretched by means of the text matrix:
/NxF0 1 Tf
49.516754 49.477234 -49.477234 49.516754 176.690933 217.316086 Tm
Thus, we need to take the text matrix into account. Unfortunately the text matrix is a private member. Thus, we will also need some reflection magic.
Thus, a possible background remover for that file looks like this:
class BigTextRemover : PdfContentStreamEditor
{
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
{
if (TEXT_SHOWING_OPERATORS.Contains(operatorLit.ToString()))
{
Vector fontSizeVector = new Vector(0, Gs().FontSize, 0);
Matrix textMatrix = (Matrix) textMatrixField.GetValue(this);
Matrix curentTransformationMatrix = Gs().GetCtm();
Vector transformedVector = fontSizeVector.Cross(textMatrix).Cross(curentTransformationMatrix);
float transformedFontSize = transformedVector.Length;
if (transformedFontSize > 40)
return;
}
base.Write(processor, operatorLit, operands);
}
System.Reflection.FieldInfo textMatrixField = typeof(PdfContentStreamProcessor).GetField("textMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
List<string> TEXT_SHOWING_OPERATORS = new List<string>{"Tj", "'", "\"", "TJ"};
}
The 40 has been chosen with that text matrix in mind.
Applying it like this
[Test]
public void testRemoveBigText()
{
string source = #"sid-1.pdf";
string dest = #"sid-1-noBigText.pdf";
using (PdfReader pdfReader = new PdfReader(source))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write)))
{
PdfContentStreamEditor editor = new BigTextRemover();
for (int i = 1; i <= pdfReader.NumberOfPages; i++)
{
editor.EditPage(pdfStamper, i);
}
}
}
to your sample file results in:

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract font height and rotation from PDF files with iText/iTextSharp - c#

Related

How to get uniform line space for a mixed paragraph of texts and images

Resize Page Height and Width with image inside c#

iText 7 ImageRenderInfo Matrix contains negative height on Even number Pages

What value to use for .MoveUp of canvas

Removing Text based watermarks using itextsharp

Categories

Resources