How to get coordinates of each word of PDF?

How to get coordinates of each word of PDF? - c#

For each word I am creating an object of LocationTextExtractionStrategy class to get its coordinates but the problem is each time I pass a word it is returning coordinates of all the chunks of that word present in pdf. How can i get coordinates of the word present at specific position or in a specific line?
I found a code somewhere
namespace PDFAnnotater
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public string Text;
public RectAndText(iTextSharp.text.Rectangle rect, string text)
{
this.Rect = rect;
this.Text = text;
}
}
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
public string TextToSearchFor { get; set; }
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(string textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
{
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0)
{
return;
}
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
}
}
I am passing words from an array to check for its coordinates. The problem is that RenderText() method is automatically called again and again for each chunk and returns the list of coordinates of the word present at different places in the pdf. For example if i need coordinate of '0' it is returning 23 coordinates. What should I do or modify in the code to get the exact coordinate of the word?

Your question is a bit confusing.
How can I get coordinates of the word present at specific position
In that statement you're basically saying "How can I get the coordinates of something that I already know the coordinates of?" Which is redundant.
I'm going to interpret your question as "How can I get the coordinates of a word, if I know the approximate location?"
I'm not familiar with C#, but I assume there are methods similar to the ones in Java for working with Rectangle objects.
Rectangle#intersects(Rectangle other)
Determines whether or not this Rectangle and the specified Rectangle intersect.
and
Rectangle#contains(Rectangle other)
Tests if the interior of the Shape entirely contains the specified Rectangle2D.
Then the code becomes trivially easy.
You use LocationTextExtractionStrategy to fetch all the iText based rectangles
you convert them to native rectangle objects (or write your own class)
for every rectangle you test whether the given search region contains that rectangle, keeping only those that are within the search region
If you want to implement your second use-case (getting the location of a word if you know the line) then there are two options:
you know the rough coordinates of the line
you want this to work given a line number
For option 1:
build a search region. Use the bounds of the page to get an idea of the width (since the line could stretch over the entire width), and add some margin y)-coordinates (to account for font differences, subscript and superscript, etc)
Now that you have a search region, this reverts to my earlier answer.
For option 2:
you already have the y coordinate of every word
round those (to the nearest multiple of fontsize)
build a Map where you keep track of how many times a certain y-coordinate is used
remove any statistical outliers
put all these values in a List
sort the list
This should give you a rough idea of where you can expect a given line(number) to be.
Of course, similar to my earlier explanation, you will need to take into account some padding and some degree of flexibility to get the right answer.

Related

C# iText7 text coordinate extraction question

I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF. Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives. I was wondering if there was a standard approach to dealing with negative coordinates here. This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.
I am deriving from the LocationTextExtractionStrategy and code is as follows:
private class LocationTextListStrategy : LocationTextExtractionStrategy
{
private readonly List<TextRect> _textRects = new List<TextRect>();
public List<TextRect> TextRects() => _textRects;
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
var renderInfo = (TextRenderInfo)data;
var text = renderInfo.GetCharacterRenderInfos();
foreach (var t in text)
{
if (string.IsNullOrWhiteSpace(t.GetText()))
continue;
AddTextRect(t);
}
}
private void AddTextRect(TextRenderInfo t)
{
var letterStart = t.GetBaseline().GetStartPoint();
var letterEnd = t.GetAscentLine().GetEndPoint();
var newTextRect = new TextRect(
text: t.GetText(),
l: letterStart.Get(0),
r: letterEnd.Get(0),
t: letterEnd.Get(1),
b: letterStart.Get(1));
_textRects.Add(newTextRect);
}
}

Each PDF page can have its own, custom coordinate system. It is common to have the origin in the lower left corner of the page but it is not required.
Key
Type
Value
MediaBox
rectangle
(Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").
CropBox
rectangle
(Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries"). Default value: the value of MediaBox.
(ISO 32000-2:2017, Table 31 — Entries in a page)
Thus, always interpret coordinates in respect to the crop box of the page they refer to.
The iText 7 class PdfPage has matching getters.

Ifc2x3 Equivalent for IfcExtrudedAreaSolidTapered

I want to be able to implement a truncated cone in IFC. I know that there is a rather quick way to implement this in IFC 2x4 with the IfcExtrudedAreaSolidTapered class.
Can anybody tell me how to do that with Ifc 2x3?
Here's what I have:
IfcExtrudedAreaSolid CreateExtrudedAreaSolid(IfcStore model, IfcProfileDef
profile,IfcAxis2Placement3D placement, double extrude)
{
var extrusion = model.Instances.New<IfcExtrudedAreaSolid>();
extrusion.Depth = extrude;
extrusion.ExtrudedDirection = model.Instances.New<IfcDirection>(d =>
d.SetXYZ(0, 0, 1));
extrusion.Position = placement;
extrusion.SweptArea = profile;
return extrusion;
}
And here's where I create the profile:
private IfcCircleHollowProfileDef MakeCircleHollowProfileDef(IfcStore model,
IfcAxis2Placement3D placement, double r, double wallThickness)
{
var circleProfile = model.Instances.New<IfcCircleHollowProfileDef>();
circleProfile.Position = ConvertToAxis2D(placement, model);
circleProfile.Radius = r;
circleProfile.WallThickness = wallThickness;
return circleProfile;
}
Does anybody have an idea how to do that the right way?

I would go for a cone and cut it (via BooleanResult) with a half space. You want the boolean operation to be DIFFERENCE, the cone as first operand and the half space the second operand.
I don't have the code to implement that in xBim (I use IfcPlusPlus), sorry. From your given code, one information you need to calculate would be the full height of the cone to cut it back to the desired height.

Given a collection of Locations, how can I determine the appropriate Zoom level and Map Center for Bing Maps?

I'm working on a Windows Store App (C#) using Bing Maps.
I want to be able to, given collection of Locations (latitude and longitude pairs), determine what the zoom level for the map should be, and what its center point (Location) should be.
From the collection of Location values, I extract the four "extreme" cardinal points that need to be displayed (furthest north, south, east, and west).
IOW, if I want to display pushpins throughout San Francisco, I want to get the zoom level to show just that city and nothing more. If I want to display pushpins scattered across the U.S., ... you get the picture.
This is what I have so far (just a rough draft/pseudocode, as you can see):
Determine the extreme cardinal values of a set of Locations (code not shown; should be trivial). Create an instance of my custom class:
public class GeoSpatialBoundaries
{
public double furthestNorth { get; set; }
public double furthestSouth { get; set; }
public double furthestWest { get; set; }
public double furthestEast { get; set; }
}
...then call these methods, passing that instance:
// This seems easy enough, but perhaps my solution is over-simplistic
public static Location GetMapCenter(GeoSpatialBoundaries gsb)
{
double lat = (gsb.furthestNorth + gsb.furthestSouth) / 2;
double lon = (gsb.furthestWest + gsb.furthestEast) / 2;
return new Location(lat, lon);
}
// This math may be off; just showing my general approach
public static int GetZoomLevel(GeoSpatialBoundaries gsb)
{
double latitudeRange = gsb.furthestNorth - gsb.furthestSouth;
double longitudeRange = gsb.furthestEast - gsb.furthestWest;
int latZoom = GetZoomForLat(latitudeRange);
int longZoom = GetZoomForLong(longitudeRange);
return Math.Max(latZoom, longZoom);
}
Here's where I really get lost, though. How do I determine the zoom level to return (between 1..20) based on these vals? Here's a very rough idea (GetZoomForLat() is basically the same):
// Bing Zoom levels range from 1 (the whole earth) to 20 (the tippy-top of the cat's whiskers)
private static int GetZoomForLong(double longitudeRange)
{
// TODO: What Zoom level ranges should I set up as the cutoff points? IOW, should it be something like:
if (longitudeRange > 340) return 1;
else if (longitudeRange > 300) return 2;
// etc.? What should the cutoff points be?
else return 1;
}
Does anyone have any suggestions or links that can point me to how to implement these functions?

I wrote a blog post on how to do this a while ago here: http://rbrundritt.wordpress.com/2009/07/21/determining-best-map-view-for-an-array-of-locations/

You can use the LocationRect class to set the bounding box, see the MSDN:
http://msdn.microsoft.com/en-us/library/hh846491.aspx
Then you use the Map class and its SetView() method, see the MSDN:
http://msdn.microsoft.com/en-us/library/hh846504.aspx
Here is a code that would work (where map is your map control instance):
var collection = new LocationCollection();
collection.Add(new Location(47.5, 2.75));
collection.Add(new Location(48.5, 2.75));
collection.Add(new Location(43.5, 5.75));
map.SetView(new LocationRect(collection));
So you can use each coordinates of the elements that you want to display on your map in the collection to generate the bounding box and set the view accordingly.

How to share one value among multiple objects? C#

Is it possible to have some variable shared between some objects of the same class such that, when the value is changed in one object, it will also change in the other object? Static variables would not work in this case, because there could be 2 objects that all have some related variable and another 2 objects that have a different related variable.
For example, say I have 4 squares that are arranged to make one large square, and the squares coordinates lie in an x,y,z plane. When the 4 squares are together, they would all have a point that lies in the center of the biggest square.
Pretend this square also has a z coordinate. Now, the squares will all share the point that lies in the center. The top left square's bottom right corner, the top right square's bottom left corner, etc., will all have the same (x, y, z) value.
Now what I want, is such that if the z value of one square changes, they will all change without any extra code, like they all point to the same memory location, so they "automatically" update in a sense.
Is something like this possible?

Here is one way you could do this. Each square's data property is pointing to the same reference.
class Square
{
private SharedData Data;
public Square(SharedData data)
{
this.Data = data;
}
}
class SharedData
{
public double Z { get; set; }
}
SharedData data = new SharedData() { Z = 100.0 }
Square topLeft = new Square(data);
Square topRight = new Square(data);
Square bottomLeft = new Square(data);
Square bottomRight = new Square(data);
You can put SharedData behind an interface that provides read-only access to the squares if you wish. If the squares are not supposed to modify the value of Z this would be a safer approach.
interface IReadOnlyData
{
double GetZ();
}
class SharedData : IReadOnlyData
{
public double Z { get; set; }
IReadOnlyData.GetZ() { return Z; }
}
class Square
{
private IReadOnlyData Data;
public Square(IReadOnlyData data)
{
this.Data = data;
}
}

I think you are thinking about this the wrong way. If all of the squares are to be treated as logically one object, make one object that encapsulates all of them and provides the logic you want. You shouldn't have objects sneakily mutating other objects' values behind their backs, IMO.

Implement the observable pattern.
Check this post:
Super-simple example of C# observer/observable with delegates
Hope it helps you.

"Matrix decomposition" of a matrix with holonic sub structure

Before I start, I must say that for those with a background of linear algebra, this is NOT matrix decomposition as you know it. Please read the following paragraphs to get a clearer understanding of the problem I am trying to solve.
Here are the salient properties/definitions of the matrix and its submatrices:
I have an SxP matrix which forms a grid like structure of S.P "boxes". This is the main matrix.
This is what the (empty) main matrix looks like. Each square in the matrix is simply referred to as a box. The matrix can be viewed as a a kind of "gameboard" e.g. a chess board. The vertical axis is measured using an interval scale (i.e. real numbers), and the horizontal axis is measured using monotonically increasing non-negative integers.
There is an additional concept of submatrices (as explained earlier). A submatrix is simply a collection of boxes in a particular configuration, and with specific numbers and piece types (see black and white pieces below), assigned to the boxes. I have a finite set of these sub matrices - which I refer to as my lexicon or vocabulary for carrying out valid matrix composition/decompositions.
The "formal" definition of a sub matrix is that it is a configuration of M boxes contained within the main matrix, that satisfy the criteria:
1 <=M<= 4
the "gap" G (i.e. distance) between any two adjacent boxes satisfies: 1<= G<= 2*(vertical units).
A vertical unit is the gap between the vertical axis lines in the main matrix. In the image below, the vertical unit is 100.
The image immediately above illustrates a simple sub matrix addition. The units with orange boarders/boxes are sub matrices - the recognized units that form part of my lexicon. You will notice that I have introduced further annotation in my sub matrices. This is because (using the chess analogy), I have two types of pieces I can use on the board. B means a black piece, and W (not shown in the image), represents a white piece. A recognized unit (or lexeme/sub matrix) There is a simple equivalence relation that allows conversion between a white piece and a black piece. This relationship can be used to further decompose a submatrix to use either exclusively black pieces, white pieces or a combination of both.
For the sake of simplicity, I have omitted specifying the equivalence relationship. However, if someone feels that the problem as posed is not "too difficult" without additional details, I shall gladly broaden the scope. For now, I am trying to keep things as simple as possible, to avoid confusing people with "information overload".
Each box in a sub matrix contains a signed integer, indicating a number of units of an item. Each "configuration" of boxes (along with its signed integers and piece type i.e. black or white pieces) is said to be a "recognized unit".
Submatrices can be placed in the main matrix in a way such that they overlap. Wherever the "boxes" overlap, the number of units in the resulting submatrix box is the sum of the number of units in the constituent boxes (as illustrated in the second image above).
The problem becomes slightly difficult because, the "recognized units" defined above themselves are sometimes combined with other "recognized units" to form another "recognized unit" - i.e. the sub matrices (i.e.recognized units) are "holons". For example, in the second image above, the recognized unit being added to the matrix can itself be further decomposed into "smaller" submatrices.
This sort of holarchy is similar to how (in Physical chemistry), elements form compounds, which then go on to form ever more complicated compounds (amino acids, proteins etc).
Back to our problem, given a main matrix M, I want to be able to do the following:
i. identify the submatrices (or recognized units) that are contained within the main matrix. This is the first "matrix decomposition". (Note: a submatrix has to satisfy the criteria given above)
ii. For each identified submatrix, I want to be able to recognize whether it can be decomposed further into 2 or more recognized submatrices. The idea is to iteratively decompose submatrices found in step i above, until either a specified hierarchy level is reached, or until we have a finite set of submatrices that can not be decomposed further.
I am trying to come up with an algorithm to help me do (i) and (ii) above. I will implement the logic in either C++, Python or C# (in increasing level of preference), depending on which ever is the easiest to do and/or in which I happen to get snippets to get me started in implementing the algorithm.

I am not sure if i have a understand correctly the problem.
So first ypu want to find all submatrixes that conform with your 2 criterea.
Thats like a graph decomposition problem or a set coverage problem i think, where you can have a recursive function and iterate the matrix to find all available submatrixes.
enum PieceTypes
{
White,
Black
}
class Box
{
public PieceTypes PieceType { get; set; }
public uint Units { get; set; }
public int s, p;
public Box(PieceTypes piecetype, uint units)
{
PieceType = piecetype;
Units = units;
}
}
class Matrix
{
public Box[,] Boxes;
public int Scale, S, P, MaxNum, MaxDist;
public List<List<Box>> Configurations;
public Matrix(int s, int p, int scale, int maxnum, int maxdist)
{
S = s;
P = p;
Scale = scale;
Boxes = new Box[S, P];
MaxNum = maxnum;
MaxDist = maxdist;
Configurations = new List<List<Box>>();
}
public void Find(List<Box> Config, int s, int p)
{
// Check the max number thats valid for your configuration
// Check that the current p and s are inside matrix
if (Config.Count() < MaxNum && s >= 0 && s < S && p >= 0 && p < P)
{
foreach (Box b in Config)
{
if (Valid(b, Boxes[s, p]))
{
Boxes[s, p].s = s;
Boxes[s, p].p = p;
Config.Add(Boxes[s, p]);
break;
}
}
Find(Config, s + 1, p);
Find(Config, s - 1, p);
Find(Config, s, p + 1);
Find(Config, s, p - 1);
}
if (Config.Count() > 0) Configurations.Add(Config);
Config.Clear();
}
public bool Valid(Box b1, Box b2)
{
// Create your dist funtion here
// or add your extra validation rules like the PieceType
if (Math.Sqrt((b1.s - b2.s) ^ 2 + (b1.p - b2.p) ^ 2) <= MaxDist && b1.PieceType == b2.PieceType) return true;
else return false;
}
}
I haven't used the best data structures and i have simplified the solution. I hope its some way helpful.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.