I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF. Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives. I was wondering if there was a standard approach to dealing with negative coordinates here. This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.
I am deriving from the LocationTextExtractionStrategy and code is as follows:
private class LocationTextListStrategy : LocationTextExtractionStrategy
{
private readonly List<TextRect> _textRects = new List<TextRect>();
public List<TextRect> TextRects() => _textRects;
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
var renderInfo = (TextRenderInfo)data;
var text = renderInfo.GetCharacterRenderInfos();
foreach (var t in text)
{
if (string.IsNullOrWhiteSpace(t.GetText()))
continue;
AddTextRect(t);
}
}
private void AddTextRect(TextRenderInfo t)
{
var letterStart = t.GetBaseline().GetStartPoint();
var letterEnd = t.GetAscentLine().GetEndPoint();
var newTextRect = new TextRect(
text: t.GetText(),
l: letterStart.Get(0),
r: letterEnd.Get(0),
t: letterEnd.Get(1),
b: letterStart.Get(1));
_textRects.Add(newTextRect);
}
}
Each PDF page can have its own, custom coordinate system. It is common to have the origin in the lower left corner of the page but it is not required.
Key
Type
Value
MediaBox
rectangle
(Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").
CropBox
rectangle
(Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries"). Default value: the value of MediaBox.
(ISO 32000-2:2017, Table 31 — Entries in a page)
Thus, always interpret coordinates in respect to the crop box of the page they refer to.
The iText 7 class PdfPage has matching getters.
Related
I have a PDF that I need to find and replace some text. I know how to create overlays and add text but I can't determine how to locate the current text coordinates. This is the example I found on the bytescout site -
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "";
extractor.RegistrationKey = "";
/////find text
// Load sample PDF document
extractor.LoadDocumentFromFile(#"myPdf.pdf");
int pageCount = extractor.GetPageCount();
RectangleF location;
for (int i = 0; i < pageCount; i++)
{
// Search each page for string
if (extractor.Find(i, "OPTION 2", false, out location))
{
do
{
Console.WriteLine("Found on page " + i + " at location " + location.ToString());
}
while (extractor.FindNext(out location));
}
}
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadLine();
}
}
but it's not working because there isn't an overload Find method that takes 4 arguments. I'm not married to using Bytescout to find text coordinates off a pdf, but my company has a license. Is there a license free way to find text coordinates on a pdf if Bytescout can't accomplish what I'm trying to do?
Try Using:
extractor.Find(i, "OPTION 2", false).FoundText.Bounds
(source: https://cdn.bytescout.com/help/BytescoutPDFExtractorSDK/html/M_Bytescout_PDFExtractor_TextExtractor_Find.htm)
The FoundText property implements the ISearchResult:
https://cdn.bytescout.com/help/BytescoutPDFExtractorSDK/html/T_Bytescout_PDFExtractor_ISearchResult.htm
which has these properties:
Public property Bounds: Bounding rectangle of all search result elements. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property ElementCount: Returns count of individual search result elements.
Public property Elements: Search result elements (individual text objects included into the search result) For COM/ActiveX use GetElement(Int32) instead.
Public property Height: Height of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property Left: Left coordinate of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property PageIndex: Index of the page containing the search result.
Public property Text: Text representation of the search result. Use Elements or GetElement(Int32) to get individual elements.
Public property Top: Top coordinate of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
Public property Width: Width of the bounding rectangle of search result. Use Elements or GetElement(Int32) to get bounds of individual elements.
I have a polygon, consisting of 2D points representing pixel coordinates, in an internal data structure. I need this polygon as an HALCON region (HRegion). The conversion is supposed to happen like that:
HTuple hCols, hRows;
for (auto n = 0; n < nNodes; ++n)
{
auto v2dNode = GetNode(n);
hCols.Append(v2dNode.GetX());
hRows.Append(v2dNode.GetY());
}
HalconCpp::HObject hContour;
HalconCpp::GenContourPolygonXld(&hContour, hRows, hCols);
HalconCpp::HObject hRegion;
HalconCpp::GenRegionContourXld(hContour, &hRegion, "filled");
Whereas the contour (HContour) is valid, according to Halcon Variable Inspect, the created region (hRegion) seems to be empty. HRegion::IsInitialized returns true, but HRegion::AreaCenter would return zero for both area and position, which is clearly wrong.
There's constructor versions of these function calls (e.g. GenContourPolygonXld), too, using the "iconic" types HXLDCont and HRegion, which result in an incorrect region as well.
What I also tried is to serialize the contour, save it in an file and load it in HDevelop. There, the corresponding code does create a valid region:
open_file('D:/HContour.mvt', 'input_binary', hFile)
fread_serialized_item(hFile, hSer)
deserialize_xld(hContour, hSer)
close_file(hFile)
gen_region_contour_xld(hContour, hRegion, 'filled')
area_center(hRegion, Area, Row, Column)
In C# I also loaded that contour file and tried to create the corresponding region. That approach resulted in an incorrectly empy region however, too:
HObject hObj;
using (var hFile = new HFile(#"D:\\HContour.mvt", "input_binary"))
{
FreadSerializedItem(hFile, out var hSerialized);
DeserializeXld(out hObj, hSerialized);
}
var hContour = new HXLDCont(hObj);
var hRegion = hContour.GenRegionContourXld("filled");
var area = hRegion.AreaCenter(out double row, out var col);
Console.WriteLine($"Area: {area}, Center: {col}|{row}");
The Halcon version is 12.0.3.
Is there a bug in the library, or am I doing it wrong in the C++ and C# code?
Edit:
Before any Halcon code is executed, the following settings are made:
HalconCpp::ResetObjDb(5000, 5000, 1);
HalconCpp::SetSystem("clip_region", "false");
HalconCpp::SetSystem("store_empty_region", "true");
All coordinates are in a valid range, and regions are not clipped.
The contour that has been used for testing is this.
Could it be that your region is partially outside of predefined region work space. Meaning that some of the pixels have coordinates less than zero?
If that's the case, all you need to do before loading is run this command:
set_system ('clip_region', 'false')
For each word I am creating an object of LocationTextExtractionStrategy class to get its coordinates but the problem is each time I pass a word it is returning coordinates of all the chunks of that word present in pdf. How can i get coordinates of the word present at specific position or in a specific line?
I found a code somewhere
namespace PDFAnnotater
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public string Text;
public RectAndText(iTextSharp.text.Rectangle rect, string text)
{
this.Rect = rect;
this.Text = text;
}
}
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
public string TextToSearchFor { get; set; }
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(string textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
{
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0)
{
return;
}
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
}
}
I am passing words from an array to check for its coordinates. The problem is that RenderText() method is automatically called again and again for each chunk and returns the list of coordinates of the word present at different places in the pdf. For example if i need coordinate of '0' it is returning 23 coordinates. What should I do or modify in the code to get the exact coordinate of the word?
Your question is a bit confusing.
How can I get coordinates of the word present at specific position
In that statement you're basically saying "How can I get the coordinates of something that I already know the coordinates of?" Which is redundant.
I'm going to interpret your question as "How can I get the coordinates of a word, if I know the approximate location?"
I'm not familiar with C#, but I assume there are methods similar to the ones in Java for working with Rectangle objects.
Rectangle#intersects(Rectangle other)
Determines whether or not this Rectangle and the specified Rectangle intersect.
and
Rectangle#contains(Rectangle other)
Tests if the interior of the Shape entirely contains the specified Rectangle2D.
Then the code becomes trivially easy.
You use LocationTextExtractionStrategy to fetch all the iText based rectangles
you convert them to native rectangle objects (or write your own class)
for every rectangle you test whether the given search region contains that rectangle, keeping only those that are within the search region
If you want to implement your second use-case (getting the location of a word if you know the line) then there are two options:
you know the rough coordinates of the line
you want this to work given a line number
For option 1:
build a search region. Use the bounds of the page to get an idea of the width (since the line could stretch over the entire width), and add some margin y)-coordinates (to account for font differences, subscript and superscript, etc)
Now that you have a search region, this reverts to my earlier answer.
For option 2:
you already have the y coordinate of every word
round those (to the nearest multiple of fontsize)
build a Map where you keep track of how many times a certain y-coordinate is used
remove any statistical outliers
put all these values in a List
sort the list
This should give you a rough idea of where you can expect a given line(number) to be.
Of course, similar to my earlier explanation, you will need to take into account some padding and some degree of flexibility to get the right answer.
I've got several textboxes on a PDF form. Most display fine, with data in them where appropriate, such as the "Required Date" textbox. Others display fine when passed no data (such as "Payment Amount"). Those at the bottom, though, do not display at all. e.g., look at the bottom section of the form, from "Requester / Payee Signature" down through the bottom of that ("Authorization") section - only horizontal lines appear below those labels, rather than the TextBoxes that should be there:
The code they use is virtually the same:
// "Request Date" is an example of those which displays fine:
PdfPCell cellReqDateTextBox = new PdfPCell()
{
CellEvent = new DynamicTextbox("textBoxReqDate", boxRequestDate.Text)
};
tblFirstRow.AddCell(cellReqDateTextBox);
. . .
doc.Add(tblFirstRow);
// Requester / Payee Signature" stands for all those who foul up:
PdfPCell cellTextBoxRequesterPayeeSignature = new PdfPCell()
{
CellEvent = new DynamicTextbox("textBoxRequesterPayeeSignature", "Enter signature here")
};
tblSection6_Row2.AddCell(cellTextBoxRequesterPayeeSignature);
. . .
doc.Add(tblSection6_Row2);
They both call this:
public class DynamicTextbox : IPdfPCellEvent
{
private string fieldname;
private string fieldvalue;
public DynamicTextbox(string name, string value)
{
fieldname = name;
fieldvalue = value;
}
public void CellLayout(PdfPCell cell, Rectangle rectangle, PdfContentByte[] canvases)
{
PdfWriter writer = canvases[0].PdfWriter;
iTextSharp.text.pdf.TextField text = new iTextSharp.text.pdf.TextField(writer, rectangle, fieldname);
text.Text = fieldvalue;
PdfFormField field = text.GetTextField();
writer.AddAnnotation(field);
}
}
The only difference I can see in how they call "DynamicTextbox()" is that the ones that display pass the Text value of a Textbox as the second arg to DynamicTextbox(), whereas the ones that don't work pass a raw string - why would that matter?
Are the "horizontal lines" below the labels the TextBoxes? If so, why are they of such diminutive height?
All the other text fields belong to a cell in a row that has other content. This other content determines the height of the cell they belong too and as such also the height of the row.
The text fields that look as if they consist of a single line belong to a row with cells that have no content. These cells are added, but their height is zero. When the cell event is executed, the position parameter is a rectangle with zero height, hence the result that the fields added in such an event consist of nothing more than a line.
To avoid this, you can either define a minimum height or a fixed height. Minimum height means that the height can get a greater value than the value you define, if you add more content. Fixed height means that content that doesn't fit the height you defined won't be shown.
Use:
cellTextBoxRequesterPayeeSignature.FixedHeight = 20f;
or:
cellTextBoxRequesterPayeeSignature.MinimumHeight = 20f;
Adapt the value 20f to whichever value is most appropriate in the context of your application. The measurement unit is user units. The default is that 72 user units equal 1 inch.
Before I start, I must say that for those with a background of linear algebra, this is NOT matrix decomposition as you know it. Please read the following paragraphs to get a clearer understanding of the problem I am trying to solve.
Here are the salient properties/definitions of the matrix and its submatrices:
I have an SxP matrix which forms a grid like structure of S.P "boxes". This is the main matrix.
This is what the (empty) main matrix looks like. Each square in the matrix is simply referred to as a box. The matrix can be viewed as a a kind of "gameboard" e.g. a chess board. The vertical axis is measured using an interval scale (i.e. real numbers), and the horizontal axis is measured using monotonically increasing non-negative integers.
There is an additional concept of submatrices (as explained earlier). A submatrix is simply a collection of boxes in a particular configuration, and with specific numbers and piece types (see black and white pieces below), assigned to the boxes. I have a finite set of these sub matrices - which I refer to as my lexicon or vocabulary for carrying out valid matrix composition/decompositions.
The "formal" definition of a sub matrix is that it is a configuration of M boxes contained within the main matrix, that satisfy the criteria:
1 <=M<= 4
the "gap" G (i.e. distance) between any two adjacent boxes satisfies: 1<= G<= 2*(vertical units).
A vertical unit is the gap between the vertical axis lines in the main matrix. In the image below, the vertical unit is 100.
The image immediately above illustrates a simple sub matrix addition. The units with orange boarders/boxes are sub matrices - the recognized units that form part of my lexicon. You will notice that I have introduced further annotation in my sub matrices. This is because (using the chess analogy), I have two types of pieces I can use on the board. B means a black piece, and W (not shown in the image), represents a white piece. A recognized unit (or lexeme/sub matrix) There is a simple equivalence relation that allows conversion between a white piece and a black piece. This relationship can be used to further decompose a submatrix to use either exclusively black pieces, white pieces or a combination of both.
For the sake of simplicity, I have omitted specifying the equivalence relationship. However, if someone feels that the problem as posed is not "too difficult" without additional details, I shall gladly broaden the scope. For now, I am trying to keep things as simple as possible, to avoid confusing people with "information overload".
Each box in a sub matrix contains a signed integer, indicating a number of units of an item. Each "configuration" of boxes (along with its signed integers and piece type i.e. black or white pieces) is said to be a "recognized unit".
Submatrices can be placed in the main matrix in a way such that they overlap. Wherever the "boxes" overlap, the number of units in the resulting submatrix box is the sum of the number of units in the constituent boxes (as illustrated in the second image above).
The problem becomes slightly difficult because, the "recognized units" defined above themselves are sometimes combined with other "recognized units" to form another "recognized unit" - i.e. the sub matrices (i.e.recognized units) are "holons". For example, in the second image above, the recognized unit being added to the matrix can itself be further decomposed into "smaller" submatrices.
This sort of holarchy is similar to how (in Physical chemistry), elements form compounds, which then go on to form ever more complicated compounds (amino acids, proteins etc).
Back to our problem, given a main matrix M, I want to be able to do the following:
i. identify the submatrices (or recognized units) that are contained within the main matrix. This is the first "matrix decomposition". (Note: a submatrix has to satisfy the criteria given above)
ii. For each identified submatrix, I want to be able to recognize whether it can be decomposed further into 2 or more recognized submatrices. The idea is to iteratively decompose submatrices found in step i above, until either a specified hierarchy level is reached, or until we have a finite set of submatrices that can not be decomposed further.
I am trying to come up with an algorithm to help me do (i) and (ii) above. I will implement the logic in either C++, Python or C# (in increasing level of preference), depending on which ever is the easiest to do and/or in which I happen to get snippets to get me started in implementing the algorithm.
I am not sure if i have a understand correctly the problem.
So first ypu want to find all submatrixes that conform with your 2 criterea.
Thats like a graph decomposition problem or a set coverage problem i think, where you can have a recursive function and iterate the matrix to find all available submatrixes.
enum PieceTypes
{
White,
Black
}
class Box
{
public PieceTypes PieceType { get; set; }
public uint Units { get; set; }
public int s, p;
public Box(PieceTypes piecetype, uint units)
{
PieceType = piecetype;
Units = units;
}
}
class Matrix
{
public Box[,] Boxes;
public int Scale, S, P, MaxNum, MaxDist;
public List<List<Box>> Configurations;
public Matrix(int s, int p, int scale, int maxnum, int maxdist)
{
S = s;
P = p;
Scale = scale;
Boxes = new Box[S, P];
MaxNum = maxnum;
MaxDist = maxdist;
Configurations = new List<List<Box>>();
}
public void Find(List<Box> Config, int s, int p)
{
// Check the max number thats valid for your configuration
// Check that the current p and s are inside matrix
if (Config.Count() < MaxNum && s >= 0 && s < S && p >= 0 && p < P)
{
foreach (Box b in Config)
{
if (Valid(b, Boxes[s, p]))
{
Boxes[s, p].s = s;
Boxes[s, p].p = p;
Config.Add(Boxes[s, p]);
break;
}
}
Find(Config, s + 1, p);
Find(Config, s - 1, p);
Find(Config, s, p + 1);
Find(Config, s, p - 1);
}
if (Config.Count() > 0) Configurations.Add(Config);
Config.Clear();
}
public bool Valid(Box b1, Box b2)
{
// Create your dist funtion here
// or add your extra validation rules like the PieceType
if (Math.Sqrt((b1.s - b2.s) ^ 2 + (b1.p - b2.p) ^ 2) <= MaxDist && b1.PieceType == b2.PieceType) return true;
else return false;
}
}
I haven't used the best data structures and i have simplified the solution. I hope its some way helpful.