I've already tried asking the question on their forums but as yet to have received a response, hope someone can help me on here.
I have setup a custom report screen in asp.net where people can drag labels and fields and Migradoc produces this accordingly by using textframes and top/left/width/height properties to lay them out in the same place they were dragged/resized to. This all works great however one issue I have is if the text is longer than the textframe it runs off the page and I need the page to move on accordingly whilst retaining the other objects in place.
I can use the below code to measure a string:
Style style = document.Styles["Normal"];
TextMeasurement tm = new TextMeasurement(style.Font.Clone());
float fh = tm.MeasureString(value, UnitType.Millimeter).Height;
float fw = tm.MeasureString(value, UnitType.Millimeter).Width;
But it's only useful for comparing the width against the frame and not the height because it could be different once put into a smaller area. Does anyone know how I can measure this string based on bound width/height values i.e. within a text frame.
Look at the CreateBlocks() method in the XTextFormatter class and how it calls MeasureString in a loop to break the text to multiple lines.
I'm afraid you have to implement such a loop yourself.
Or maybe use the PrepareDocument() method of the DocumentRenderer class to let MigraDoc do the work and just query the dimensions when it's done.
BTW: a similar question had been asked at the forum before:
http://forum.pdfsharp.net/viewtopic.php?p=3590#p3590
Answer includes some source code.
An easy way to do this (using I-liked-the-old-stack-overflow's link) is to add the PdfWordWrapper class to your project and then calculate the dimensions of your text as follows:
var wrapper = new PdfWordWrapper(g, contentWidth); //g is your XGraphics object
wrapper.Add("My text here", someFont, XBrushes.Black);
wrapper.Process();
var dimensions = wrapper.Size; //you can access .Height or .Width
//If you want to reuse the wrapper just call .Clear() and then .Add() again with some new text
Related
I was able to successfully use the following code to highlight text in an existing PDF:
private static void highlightDiff(PdfStamper stamper, Rectangle rectangle, int page)
{
float[] quadPoints = { rectangle.Left, rectangle.Bottom, rectangle.Right, rectangle.Bottom, rectangle.Left, rectangle.Top, retangle.Right, rectangle.Top };
PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rectangle, null, PdfAnnotation.MARKUP_HIGHLIGHT, quadPoints);
highlight.Color = BaseColor.RED;
stamper.AddAnnotation(highlight, page);
}
The problem is I'm highlighting characters at a time and my guess is a new layer is added every time I call this function because the resulting file size is significantly larger after the program has completed running.
I tried to add the following lines at the end of the function and maybe it's just me but it seemed to have sped up the time it takes the PDF to load when I go to view it but the size of the file still remains exceedingly large.
stamper.FreeTextFlattening = true;
I may try to make my code more efficient and decrease the number of calls I make (if the characters I'm highlighting are next to each other, find the combined rectangle and call) but was wondering if there was another way around this. Thanks in advance!
Each time you execute highlightDiff you add a new highlight annotation to the PDF. Inside the PDF such an annotation is an object like this:
1 0 obj
<<
/Rect[204.68 705.11 211.2 716.11]
/Subtype/Highlight
/Contents()
/QuadPoints[204.68 716.11 211.2 716.11 204.68 705.11 211.2 705.11]
/C[1 0 0]
/P 2 0 R
>>
Furthermore there needs to be a reference to this object from the page description plus an entry in the internal cross references.
Thus, each such call makes the PDF grow by nearly 200 bytes. If you highlight many such individual characters, the file indeed will grow considerably.
I may try to make my code more efficient and decrease the number of calls I make (if the characters I'm highlighting are next to each other, find the combined rectangle and call) but was wondering if there was another way around this.
If you indeed want your highlighting to be done using highlighting annotations, there is not way around this.
If you on the other hand would also accept highlighting rectangles to be drawn in the regular page content, you may see less file size growth using that approach. Even then, though, first combining neighboring rectangles would reduce file size (and PDF viewer resource requirements) considerably.
See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.
Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.
var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
string[] lines;
var strategy = new LocationTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);
I also tried playing with Annotation parsing but didn't have luck.
I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?
Thanks a lot.
You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.
To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.
Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.
To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:
You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
You parse the same page in a LocationTextExtractionStrategy instance.
You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.
(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)
All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.
The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.
in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:
store coordinates of each "box" in the config file
process documents and exract text from every "box" (i.e. region)
additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)
In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)
This approach should work with any PDF library.
Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.
I don't think it can read annotations though.
I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.
I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.
Rectangle rect = new Rectangle(
pdfArray.GetAsNumber(0).FloatValue,
pdfArray.GetAsNumber(1).FloatValue,
pdfArray.GetAsNumber(2).FloatValue,
pdfArray.GetAsNumber(3).FloatValue);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;
The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.
The cause
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:
/**
* Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
* #return A list of {#link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
* #since 5.3.3
*/
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net
Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.
A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class TextRenderInfoSplitter:
package stackoverflow.itext.extraction;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class TextRenderInfoSplitter implements TextExtractionStrategy
{
public TextRenderInfoSplitter(TextExtractionStrategy strategy)
{
this.strategy = strategy;
}
public void renderText(TextRenderInfo renderInfo)
{
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
strategy.renderText(info);
}
}
public void beginTextBlock()
{
strategy.beginTextBlock();
}
public void endTextBlock()
{
strategy.endTextBlock();
}
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
}
If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:
String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));
I tested it with the PDF created in this answer for the area
Rectangle rect = new Rectangle(200, 600, 200, 135);
For reference I marked the area in the PDF:
Text extraction filtered by area without the TextRenderInfoSplitter results in:
I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox
Text extraction filtered by area with the TextRenderInfoSplitter results in:
to create a PDF f
ntents in the docu
n g P D F
BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:
public boolean allowText(TextRenderInfo renderInfo){
LineSegment segment = renderInfo.getBaseline();
Vector startPoint = segment.getStartPoint();
Vector endPoint = segment.getEndPoint();
float x1 = startPoint.get(Vector.I1);
float y1 = startPoint.get(Vector.I2);
float x2 = endPoint.get(Vector.I1);
float y2 = endPoint.get(Vector.I2);
return filterRect.intersectsLine(x1, y1, x2, y2);
}
Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter by copying RegionTextRenderFilter and then replacing the line
return filterRect.intersectsLine(x1, y1, x2, y2);
by
return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.
Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the /QuadPoints entry in the dictionary.
Why are they this way?
This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.
As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.
In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.
Almost all text-related code was refactored to use this code.
Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.
How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.
Is there any way to make lines between points, given a simple geometry as line style, using WPF geometries? I know it is possible to make these kind of lines:
-- -- --- --
But I want to make lines, using any simple geometry (e.g: the '^' symbol). So what I want is something like these: (the line may not necessarily be horizontal or vertical):
^^^^^^^^^^^^^^^^^
*****************
Note: I don't want to make line with some characters. I want to do it using any arbitrary geometries (e.g: start shape, triangle, or any other geometry). In other word I want to repeat some geometries along a linear path between two points. So these simple geometries may be rotated somehow to follow the line and ...
I think this is an interesting problem but I can't fit a satisfying answer in the stackoverflow textbox so I uploaded a proposed solution on github:
https://github.com/mrange/CodeStack/tree/master/q14545675/LineGeometry
I don't claim this is 100% solution to your problem (for one I am not 100% of all your requirements) but if you take a look at it perhaps something can be worked and improved upon.
Unless ofc I am way wrong on what you are looking for.
If I understand correctly, you'd like to use the * or ^ or ! as a line essentially. Rather then use a normal solid, dash, dotted, and so on line you'd like to use physical characters? But you'd like those characters to become a Geometry object.
You could do something like:
// Create a line of characters.
string lineString = "^^^^^^^^^^^^^^";
// Create Formatted Text, customize accordingly.
FormattedText formatText = new FormattedText(
lineString, CultureInfo.GetCultureInfo("en-us"),
FlowDirection.LeftToRight,
new Typeface("Arial"), 32, Brushes.Black);
// Set the Width and Height.
formatText.MaxTextWidth = 200;
formatText.MaxTextHeight = 100;
// You can obviously add as many customization's and outputs of your choice.
Now I understand this isn't what you want, you want the above string to act in Geometry. To accomplish that; you just need to do:
// Build Geometry object to represent text.
Geometry lineGeometry = formatText.BuildGeometry(new System.Windows.Point(0, 0));
// Tailor Geometry object that represents our item.
Geometry hGeo = formatText.BuildHighlightGeometry(new System.Windows.Point(0, 0));
Now essentially you've built a Geometry object that represents "^^^^^^^^".
Hopefully I understood correctly, I don't know if that solves your problem.
I'm rendering text using FormattedText, but there does appear to be any way to perform per-char hit testing on the rendered output. It's read-only, so I basically only need selection, no editing.
I'd use RichTextBox or similar, but I need to output text based on control codes embed in the text itself, so they don't always nest, which makes building the right Inline elements very complex. I'm also a bit worried about performance with that solution; I have a large number of lines, and new lines are appended often.
I've looked at GlyphRun, it appears I could get hit-testing from it or a related class, but I'd be reimplementing a lot of functionality, and it seems like there should be a simpler way...
Does anyone know of a good way to implement this?
You can get the geometry of each character from a FormattedText object and use the bounds of each character to do your hit testing.
var geometry = (GeometryGroup)((GeometryGroup)text.BuildGeometry(new Point(0, 0))).Children[0];
foreach (var c in geometry.Children)
{
if (c.Bounds.Contains(point))
return index;
index++;
}
In OnRender you can render these geometry objects instead of the formatted text.
The best way is to design a good data structure for storing your text and which also considers hit-testing. One example could be to split the text into blocks (words, lines or paragraphs depending on what you need). Then each such block should have a bounding-box which should be recomputed in any formatting operations. Also consider caret positions in your design.
Once you have such facility it becomes very easy to do hit-testing, just use the bounding boxes. It will also help in subsequent operations like highlighting a particular portion of text.
Completely agree with Sesh - the easiest way you're going to get away with not re-implementing a whole load of FormattedText functionality is going to be by splitting up the individual items you want to hit-test into their own controls/inlines.
Consider using a TextBlock and adding each word as it's own Inline ( or ), then either bind to the inline's IsMouseDirectlyOver property, our add delegates to the MouseEnter & MouseLeave events.
If you want to do pixel-level hit testing of the actual glyphs (i.e. is the mouse exactly in the dot of this 'i'), then you'll need to use GlyphRuns and do manual hit testing on the glyphs (read: hard work).
I'm very late to the party--if the party is not over, and you don't need the actual character geometry, I found something like this useful:
for (int i = 0; i < FormattedText.Text.Length; i++)
{
characterHighlightGeometry = FormattedText.BuildHighlightGeometry(new Point(), i, 1);
CharacterHighlightGeometries.Children.Add(characterHighlightGeometry);
}
BuildGeometry() only includes the actual path geometry of a character. BuildHighlightGeometry() generates the outer bounds of all characters--including
spaces, so an index to a space can be located by:
foreach (var c in CharacterHighlightGeometries.Children)
{
if (c.Bounds.Contains(centerpoint))
{
q = c;
cpos = index;
break;
}
index++;
}
Hope this helps.