Finding text coordinates using bytescout PDFExtractor C#

Finding text coordinates using bytescout PDFExtractor C# - c#

I have a PDF that I need to find and replace some text. I know how to create overlays and add text but I can't determine how to locate the current text coordinates. This is the example I found on the bytescout site -
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "";
extractor.RegistrationKey = "";
/////find text
// Load sample PDF document
extractor.LoadDocumentFromFile(#"myPdf.pdf");
int pageCount = extractor.GetPageCount();
RectangleF location;
for (int i = 0; i < pageCount; i++)
{
// Search each page for string
if (extractor.Find(i, "OPTION 2", false, out location))
{
do
{
Console.WriteLine("Found on page " + i + " at location " + location.ToString());
}
while (extractor.FindNext(out location));
}
}
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadLine();
}
}
but it's not working because there isn't an overload Find method that takes 4 arguments. I'm not married to using Bytescout to find text coordinates off a pdf, but my company has a license. Is there a license free way to find text coordinates on a pdf if Bytescout can't accomplish what I'm trying to do?

Try Using:
extractor.Find(i, "OPTION 2", false).FoundText.Bounds
(source: https://cdn.bytescout.com/help/BytescoutPDFExtractorSDK/html/M_Bytescout_PDFExtractor_TextExtractor_Find.htm)
The FoundText property implements the ISearchResult:
https://cdn.bytescout.com/help/BytescoutPDFExtractorSDK/html/T_Bytescout_PDFExtractor_ISearchResult.htm
which has these properties:
Public property Bounds: Bounding rectangle of all search result elements. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property ElementCount: Returns count of individual search result elements.
Public property Elements: Search result elements (individual text objects included into the search result) For COM/ActiveX use GetElement(Int32) instead.
Public property Height: Height of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property Left: Left coordinate of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property PageIndex: Index of the page containing the search result.
Public property Text: Text representation of the search result. Use Elements or GetElement(Int32) to get individual elements.
Public property Top: Top coordinate of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property Width: Width of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.

Related

C# iText7 text coordinate extraction question

I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF. Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives. I was wondering if there was a standard approach to dealing with negative coordinates here. This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.
I am deriving from the LocationTextExtractionStrategy and code is as follows:
private class LocationTextListStrategy : LocationTextExtractionStrategy
{
private readonly List<TextRect> _textRects = new List<TextRect>();
public List<TextRect> TextRects() => _textRects;
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
var renderInfo = (TextRenderInfo)data;
var text = renderInfo.GetCharacterRenderInfos();
foreach (var t in text)
{
if (string.IsNullOrWhiteSpace(t.GetText()))
continue;
AddTextRect(t);
}
}
private void AddTextRect(TextRenderInfo t)
{
var letterStart = t.GetBaseline().GetStartPoint();
var letterEnd = t.GetAscentLine().GetEndPoint();
var newTextRect = new TextRect(
text: t.GetText(),
l: letterStart.Get(0),
r: letterEnd.Get(0),
t: letterEnd.Get(1),
b: letterStart.Get(1));
_textRects.Add(newTextRect);
}
}

Each PDF page can have its own, custom coordinate system. It is common to have the origin in the lower left corner of the page but it is not required.
Key
Type
Value
MediaBox
rectangle
(Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").
CropBox
rectangle
(Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries"). Default value: the value of MediaBox.
(ISO 32000-2:2017, Table 31 — Entries in a page)
Thus, always interpret coordinates in respect to the crop box of the page they refer to.
The iText 7 class PdfPage has matching getters.

How to get coordinates of each word of PDF?

For each word I am creating an object of LocationTextExtractionStrategy class to get its coordinates but the problem is each time I pass a word it is returning coordinates of all the chunks of that word present in pdf. How can i get coordinates of the word present at specific position or in a specific line?
I found a code somewhere
namespace PDFAnnotater
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public string Text;
public RectAndText(iTextSharp.text.Rectangle rect, string text)
{
this.Rect = rect;
this.Text = text;
}
}
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
public string TextToSearchFor { get; set; }
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(string textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
{
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0)
{
return;
}
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
}
}
I am passing words from an array to check for its coordinates. The problem is that RenderText() method is automatically called again and again for each chunk and returns the list of coordinates of the word present at different places in the pdf. For example if i need coordinate of '0' it is returning 23 coordinates. What should I do or modify in the code to get the exact coordinate of the word?

Your question is a bit confusing.
How can I get coordinates of the word present at specific position
In that statement you're basically saying "How can I get the coordinates of something that I already know the coordinates of?" Which is redundant.
I'm going to interpret your question as "How can I get the coordinates of a word, if I know the approximate location?"
I'm not familiar with C#, but I assume there are methods similar to the ones in Java for working with Rectangle objects.
Rectangle#intersects(Rectangle other)
Determines whether or not this Rectangle and the specified Rectangle intersect.
and
Rectangle#contains(Rectangle other)
Tests if the interior of the Shape entirely contains the specified Rectangle2D.
Then the code becomes trivially easy.
You use LocationTextExtractionStrategy to fetch all the iText based rectangles
you convert them to native rectangle objects (or write your own class)
for every rectangle you test whether the given search region contains that rectangle, keeping only those that are within the search region
If you want to implement your second use-case (getting the location of a word if you know the line) then there are two options:
you know the rough coordinates of the line
you want this to work given a line number
For option 1:
build a search region. Use the bounds of the page to get an idea of the width (since the line could stretch over the entire width), and add some margin y)-coordinates (to account for font differences, subscript and superscript, etc)
Now that you have a search region, this reverts to my earlier answer.
For option 2:
you already have the y coordinate of every word
round those (to the nearest multiple of fontsize)
build a Map where you keep track of how many times a certain y-coordinate is used
remove any statistical outliers
put all these values in a List
sort the list
This should give you a rough idea of where you can expect a given line(number) to be.
Of course, similar to my earlier explanation, you will need to take into account some padding and some degree of flexibility to get the right answer.

Why do certain TextBoxes not display on my PDF page (iTextSharp)?

I've got several textboxes on a PDF form. Most display fine, with data in them where appropriate, such as the "Required Date" textbox. Others display fine when passed no data (such as "Payment Amount"). Those at the bottom, though, do not display at all. e.g., look at the bottom section of the form, from "Requester / Payee Signature" down through the bottom of that ("Authorization") section - only horizontal lines appear below those labels, rather than the TextBoxes that should be there:
The code they use is virtually the same:
// "Request Date" is an example of those which displays fine:
PdfPCell cellReqDateTextBox = new PdfPCell()
{
CellEvent = new DynamicTextbox("textBoxReqDate", boxRequestDate.Text)
};
tblFirstRow.AddCell(cellReqDateTextBox);
. . .
doc.Add(tblFirstRow);
// Requester / Payee Signature" stands for all those who foul up:
PdfPCell cellTextBoxRequesterPayeeSignature = new PdfPCell()
{
CellEvent = new DynamicTextbox("textBoxRequesterPayeeSignature", "Enter signature here")
};
tblSection6_Row2.AddCell(cellTextBoxRequesterPayeeSignature);
. . .
doc.Add(tblSection6_Row2);
They both call this:
public class DynamicTextbox : IPdfPCellEvent
{
private string fieldname;
private string fieldvalue;
public DynamicTextbox(string name, string value)
{
fieldname = name;
fieldvalue = value;
}
public void CellLayout(PdfPCell cell, Rectangle rectangle, PdfContentByte[] canvases)
{
PdfWriter writer = canvases[0].PdfWriter;
iTextSharp.text.pdf.TextField text = new iTextSharp.text.pdf.TextField(writer, rectangle, fieldname);
text.Text = fieldvalue;
PdfFormField field = text.GetTextField();
writer.AddAnnotation(field);
}
}
The only difference I can see in how they call "DynamicTextbox()" is that the ones that display pass the Text value of a Textbox as the second arg to DynamicTextbox(), whereas the ones that don't work pass a raw string - why would that matter?
Are the "horizontal lines" below the labels the TextBoxes? If so, why are they of such diminutive height?

All the other text fields belong to a cell in a row that has other content. This other content determines the height of the cell they belong too and as such also the height of the row.
The text fields that look as if they consist of a single line belong to a row with cells that have no content. These cells are added, but their height is zero. When the cell event is executed, the position parameter is a rectangle with zero height, hence the result that the fields added in such an event consist of nothing more than a line.
To avoid this, you can either define a minimum height or a fixed height. Minimum height means that the height can get a greater value than the value you define, if you add more content. Fixed height means that content that doesn't fit the height you defined won't be shown.
Use:
cellTextBoxRequesterPayeeSignature.FixedHeight = 20f;
or:
cellTextBoxRequesterPayeeSignature.MinimumHeight = 20f;
Adapt the value 20f to whichever value is most appropriate in the context of your application. The measurement unit is user units. The default is that 72 user units equal 1 inch.

How can I store the location of the pixels with colour value using C#?

I am making some lists that holds the colour value (0 for black and 1 for white) for each pixel across my image. My problem is when I finish each list it just gives me a single dimensional array that only has got 0s and 1s but I don’t know to which pixel it belong when I want to drew the output image.
Can anyone tell me if I can store location of the pixels as well as the colour value both at the same time in my list? Or any other alternative?

Answering following questions:
Can anyone tell me if I can store location of the pixels as well as
the colour value both at the same time in my list? Or any other
alternative?
If using the .NET Framework 4 or higher, you could use a Tuple to store the values. Fill tuple following way:
var LocXLocYColor = new Tuple<int, int, bool>(1, 1, true);
You could loop through all your these values, using a foreach:
int locx, locy;
bool color;
foreach(var itm in LocXlocYColor)
{
locx = itm.Item1;
locy = itm.Item2;
color = itm.Item3;
}
More Tuple information: MSDN
Above code can be used, when you want to store the pixel locations seperate as integers.
Making use of a Dictionary is another way to achieve your needs:
Create a new dictionary:
Dictionary<Point, bool> locationColor = new Dictionary<Point, bool>();
Fill dictionary with location and color:
locationColor.Add(new Point(1, 1), true);
...
Loop over items in dictionary:
Point location;
bool color;
foreach(KeyValuePair<Point, bool> itm in locationColor)
{
location = itm.Key;
color = itm.Value;th entry.Value or entry.Key
}
If using Point, don't forget to include: System.Drawing; on top of your class.
More Point information: MSDN

Dictionary<Drawing.Point, int> pixelLocations = new Diciontary<Drawing.Point, int>();
Will do what you want.
Edit: Unless you're not storing your locations as points. If they're not Points, then just substitute Drawing.Point for whatever datatype you're using.

You can use your list to know the location of the pixels (this is assuming you know the width or height).
If the pixels were arranged so that the top row of pixels was stored, then the second, etc, you could get the locations like this (where the origin is 1,1):
mylist[wantedY * width - (width - wantedX - 1)]
Where wantedY is the Y-location and wantedX is the X-location.
If, however, the pixels are arranged by column (so that the pixels with x = 0 were taken, then x = 1, etc), you can just use:
mylist[wantedX * height - (height- wantedY - 1)]

Instead of using a list, use a 2-dimensional array
bool[,] isWhite = new bool[bmp.Width, bmp.Height];
and store the values at their corresponding place in this array
isWhite[x, y] = theColor == Color.White;
The location of the pixel is the location within the array. There is no need to store it separately.

ITextSharp, Positioning text

We're using ITextSharp for reasons I do not understand and we dont have the book yet and I have an annoying problem that I'd appreciate help with.
I'm working with a footer and I can not get it to align as I want it.
The function I have takes a list of strings, but it's generally 4 strings I want on a row each. It does not seam like itextsharp can handle strings with linebreaks so that's the reason for the list.
Now this does not position correctly for me, the first string looks ok, but then the second string is a bit longer and it's half outside the document as is the third string and the 4th is not even visible even thou there is 1 cm of space left.
Thanks for help!
public PdfTemplate AddPageText(IList<string> stringList, PdfWriter writer)
{
var cb = writer.DirectContent;
PdfTemplate footerTemplate = cb.CreateTemplate(450, 120); //55);
footerTemplate.BeginText();
BaseFont bf2 = BaseFont.CreateFont(BaseFont.TIMES_ITALIC, BaseFont.WINANSI, BaseFont.NOT_EMBEDDED);
footerTemplate.SetFontAndSize(bf2, 9);
footerTemplate.SetColorStroke(BaseColor.DARK_GRAY);
footerTemplate.SetColorFill(BaseColor.GRAY);
footerTemplate.SetLineWidth(3);
footerTemplate.LineTo(50, footerTemplate.YTLM);
int y = 10;
foreach (string text in stringList)
{
float widthoftext = 500.0f - bf2.GetWidthPoint(text, 9);
footerTemplate.ShowTextAligned(PdfContentByte.ALIGN_RIGHT, text, widthoftext, 50 - y, 0);
y += y;
}
footerTemplate.EndText();
return footerTemplate;
}

If you are doing string placement and using DirectContent then you are responsible for the content. In this case you will need to calculate the string rectangles and wrap accordingly.
I would suggest, however, moving to using a table with cells for the text. Tables wrap text and handle some of the issues you are dealing with.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.