I have several PDF files, using a Windows application (C#), I need to find out whether the PDF files has overlapping text or not. How can I do it, is there any free third party DLLs to achieve this?
All I have got now is third party DLLs which can get the text/images from a PDF.
My PDFs are full of texts and images. Here, one line of text is printed on top of another line or few texts are printed on top of some images. These kind of overlapping needs to found.
As you can see in the image, those overlapping might have occurred because of bounding boxes overlap and as well as glyphs contours overlap. So these two occurrences in the PDF needs to be found. My PDF doesn't contain any annotations. So overlapping occurs only in the content of pdf. We don't use poor-man's-bold technique for fatter glyph and if that occurs then it shoul be consider as overlapping.
There is not going to be any transparent images in the PDF, only image we might have is the logo or the digital signature at the bottom of the page, any text overlaps this should be considered as overlapping.
PDFs are not created from image(scan). From some text editor it has been created.
The OP clarified in comments:
those overlapping might have occurred because of bounding boxes overlap and as well as glyphs contours overlap. So these two occurrences in the PDF needs to be found.
Whenever the glyph contours themselves overlap, their bounding boxes also overlap.
Thus, it suffices to check for overlapping bounding boxes.
only image we might have is the logo or the digital signature at the bottom of the page, any text overlaps this should be considered as overlapping.
Thus, for text overlapping images we do not need to check whether a blank area in the image is overlapped.
My PDF files doesnt have any annotations.
Thus, we only need to check the page contents (including contents of form xobjects referenced from the page content, allowing recursion).
Furthermore the OP only mentioned text and images. Thus, we can ignore vector graphics.
An approach using iText 7
As I'm more into Java, I first created a prove-of-concept in Java and ported it to .Net later.
Both for Java and .Net the line of action is the same:
We create a event listener for the iText 7 parsing framework which (while processing a page) collects the bounding boxes of text and image elements and eventually can be asked to check whether there are any occurrences of text overlapping text or image.
We parse the content of the page in question using an instance of that event listener class and query it for overlaps. If more pages are to be checked, this can be done over and over again with a new event listener instance for each page.
iText 7 for .Net
The event listener might look like this:
class OverlappingTextSearchingStrategy : IEventListener
{
static List<Vector> UNIT_SQUARE_CORNERS = new List<Vector> { new Vector(0, 0, 1), new Vector(1, 0, 1), new Vector(1, 1, 1), new Vector(0, 1, 1) };
ICollection<Rectangle> imageRectangles = new HashSet<Rectangle>();
ICollection<Rectangle> textRectangles = new HashSet<Rectangle>();
public void EventOccurred(IEventData data, EventType type)
{
if (data is ImageRenderInfo) {
ImageRenderInfo imageData = (ImageRenderInfo)data;
Matrix ctm = imageData.GetImageCtm();
List<Rectangle> cornerRectangles = new List<Rectangle>(UNIT_SQUARE_CORNERS.Count);
foreach (Vector unitCorner in UNIT_SQUARE_CORNERS)
{
Vector corner = unitCorner.Cross(ctm);
cornerRectangles.Add(new Rectangle(corner.Get(Vector.I1), corner.Get(Vector.I2), 0, 0));
}
Rectangle boundingBox = Rectangle.GetCommonRectangle(cornerRectangles.ToArray());
Console.WriteLine("Adding image bounding rectangle {0}.", boundingBox);
imageRectangles.Add(boundingBox);
} else if (data is TextRenderInfo) {
TextRenderInfo textData = (TextRenderInfo)data;
Rectangle ascentRectangle = textData.GetAscentLine().GetBoundingRectangle();
Rectangle descentRectangle = textData.GetDescentLine().GetBoundingRectangle();
Rectangle boundingBox = Rectangle.GetCommonRectangle(ascentRectangle, descentRectangle);
if (boundingBox.GetHeight() == 0 || boundingBox.GetWidth() == 0)
Console.WriteLine("Ignoring empty text bounding rectangle {0} for \"{1}\".", boundingBox, textData.GetText());
else
{
Console.WriteLine("Adding text bounding rectangle {0} for \"{1}\" with 0.5 margins.", boundingBox, textData.GetText());
textRectangles.Add(boundingBox.ApplyMargins<Rectangle>(0.5f, 0.5f, 0.5f, 0.5f, false));
}
} else if (data is PathRenderInfo) {
// TODO
} else if (data != null)
{
Console.WriteLine("Ignored {0} event, class {1}.", type, data.GetType().Name);
}
else
{
Console.WriteLine("Ignored {0} event with null data.", type);
}
}
public ICollection<EventType> GetSupportedEvents()
{
// Support all events
return null;
}
public bool foundOverlappingText()
{
bool result = false;
List<Rectangle> textRectangleList = new List<Rectangle>(textRectangles);
while (textRectangleList.Count > 0)
{
Rectangle testRectangle = textRectangleList[textRectangleList.Count - 1];
textRectangleList.RemoveAt(textRectangleList.Count - 1);
foreach (Rectangle rectangle in textRectangleList)
{
if (intersect(testRectangle, rectangle))
{
Console.WriteLine("Found text intersecting text with bounding boxes {0} at {1},{2} and {3} at {4},{5}.",
testRectangle, testRectangle.GetX(), testRectangle.GetY(), rectangle, rectangle.GetX(), rectangle.GetY());
result = true;// if only the fact counts, do instead: return true
}
}
foreach (Rectangle rectangle in imageRectangles)
{
if (intersect(testRectangle, rectangle))
{
Console.WriteLine("Found text intersecting image with bounding boxes {0} at {1},{2} and {3} at {4},{5}.",
testRectangle, testRectangle.GetX(), testRectangle.GetY(), rectangle, rectangle.GetX(), rectangle.GetY());
result = true;// if only the fact counts, do instead: return true
}
}
}
return result;
}
bool intersect(Rectangle a, Rectangle b)
{
return intersect(a.GetLeft(), a.GetRight(), b.GetLeft(), b.GetRight()) &&
intersect(a.GetBottom(), a.GetTop(), b.GetBottom(), b.GetTop());
}
bool intersect(float start1, float end1, float start2, float end2)
{
if (start1 < start2)
return start2 <= end1;
else
return start1 <= end2;
}
}
This event listener can be used like this:
PdfReader reader = new PdfReader(pdf);
PdfDocument document = new PdfDocument(reader);
PdfDocumentContentParser contentParser = new PdfDocumentContentParser(document);
OverlappingTextSearchingStrategy strategy = contentParser.ProcessContent(page, new OverlappingTextSearchingStrategy());
bool foundOverlaps = strategy.foundOverlappingText();
iText 7 for Java
The event listener might look like this:
public class OverlappingTextSearchingStrategy implements IEventListener {
static List<Vector> UNIT_SQUARE_CORNERS = Arrays.asList(new Vector(0,0,1), new Vector(1,0,1), new Vector(1,1,1), new Vector(0,1,1));
Set<Rectangle> imageRectangles = new HashSet<>();
Set<Rectangle> textRectangles = new HashSet<>();
#Override
public void eventOccurred(IEventData data, EventType type) {
if (data instanceof ImageRenderInfo) {
ImageRenderInfo imageData = (ImageRenderInfo) data;
Matrix ctm = imageData.getImageCtm();
List<Rectangle> cornerRectangles = new ArrayList<>(UNIT_SQUARE_CORNERS.size());
for (Vector unitCorner : UNIT_SQUARE_CORNERS) {
Vector corner = unitCorner.cross(ctm);
cornerRectangles.add(new Rectangle(corner.get(Vector.I1), corner.get(Vector.I2), 0, 0));
}
Rectangle boundingBox = Rectangle.getCommonRectangle(cornerRectangles.toArray(new Rectangle[cornerRectangles.size()]));
logger.info(String.format("Adding image bounding rectangle %s.", boundingBox));
imageRectangles.add(boundingBox);
} else if (data instanceof TextRenderInfo) {
TextRenderInfo textData = (TextRenderInfo) data;
Rectangle ascentRectangle = textData.getAscentLine().getBoundingRectangle();
Rectangle descentRectangle = textData.getDescentLine().getBoundingRectangle();
Rectangle boundingBox = Rectangle.getCommonRectangle(ascentRectangle, descentRectangle);
if (boundingBox.getHeight() == 0 || boundingBox.getWidth() == 0)
logger.info(String.format("Ignoring empty text bounding rectangle %s for '%s'.", boundingBox, textData.getText()));
else {
logger.info(String.format("Adding text bounding rectangle %s for '%s' with 0.5 margins.", boundingBox, textData.getText()));
textRectangles.add(boundingBox.applyMargins(0.5f, 0.5f, 0.5f, 0.5f, false));
}
} else if (data instanceof PathRenderInfo) {
// TODO: vector graphics
} else if (data != null) {
logger.fine(String.format("Ignored %s event, class %s.", type, data.getClass().getSimpleName()));
} else {
logger.fine(String.format("Ignored %s event with null data.", type));
}
}
#Override
public Set<EventType> getSupportedEvents() {
// Support all events
return null;
}
public boolean foundOverlappingText() {
boolean result = false;
List<Rectangle> textRectangleList = new ArrayList<>(textRectangles);
while (!textRectangleList.isEmpty())
{
Rectangle testRectangle = textRectangleList.remove(textRectangleList.size() - 1);
for (Rectangle rectangle : textRectangleList) {
if (intersect(testRectangle, rectangle)) {
logger.info(String.format("Found text intersecting text with bounding boxes %s at %s,%s and %s at %s,%s.",
testRectangle, testRectangle.getX(), testRectangle.getY(), rectangle, rectangle.getX(), rectangle.getY()));
result = true;// if only the fact counts, do instead: return true
}
}
for (Rectangle rectangle : imageRectangles) {
if (intersect(testRectangle, rectangle)) {
logger.info(String.format("Found text intersecting image with bounding boxes %s at %s,%s and %s at %s,%s.",
testRectangle, testRectangle.getX(), testRectangle.getY(), rectangle, rectangle.getX(), rectangle.getY()));
result = true;// if only the fact counts, do instead: return true
}
}
}
return result;
}
boolean intersect(Rectangle a, Rectangle b) {
return intersect(a.getLeft(), a.getRight(), b.getLeft(), b.getRight()) &&
intersect(a.getBottom(), a.getTop(), b.getBottom(), b.getTop());
}
boolean intersect(float start1, float end1, float start2, float end2) {
if (start1 < start2)
return start2 <= end1;
else
return start1 <= end2;
}
Logger logger = Logger.getLogger(OverlappingTextSearchingStrategy.class.getName());
}
This event listener can be used like this:
PdfReader reader = new PdfReader(pdf);
PdfDocument document = new PdfDocument(reader);
PdfDocumentContentParser contentParser = new PdfDocumentContentParser(document);
OverlappingTextSearchingStrategy strategy = contentParser.processContent(pageNumber, new OverlappingTextSearchingStrategy());
boolean foundOverlaps = strategy.foundOverlappingText();
Remarks
As you can see I don't store the text bounding boxes as they are but instead
boundingBox.applyMargins(0.5f, 0.5f, 0.5f, 0.5f, false),
i.e. slightly smaller boxes. This is done to prevent false positives which otherwise might occur for tightly set text or text with kerning applied. You may have to fine tune the margin values here.
It may be as easy as the example above or you have to implement your own reader for this.
If you have not the full control over your PDF files, you have no chance to solve your problem. The defined boxes can be transformed later on. So you have to parse the whole file, too keep track of the box position and form. Additionally some boxes may be on top of other boxes, but render without any collision on the pixel level.
Than you will run into the next problem. Each PDF implementation has different errors. So your system may render the text perfectly but not the printer of your customer.
Welcome to hell ;)
Each support guy will tell you that they obey the standard. The others must have implemented their PDF library faulty. Because your customers data will be confident, you cannot proof them wrong. You may find some errors with your test data, but never ever the same errors of your customer documents.
Run and hide as long as you have not become the PDF expert of your company.
Here is a dirty "general" method: render your text without the text in bitmap. render the page with your text in another bitmap, compare the area with your text. But this will need a monochrome background. But the load will be really high. But this document looks like a form. Create a form and fill out the form boxes. So you will have no problems and you will even get correct results, fills the form with another program
Hello I have a code sample that uses not free library, but I think other libraries should have similar functionality, so you may use it as the idea:
Before use the following code sample please ensure that you use the latest version of the Apitron PDF Kit.
using System;
using System.Collections.Generic;
using System.IO;
using Apitron.PDF.Kit.FixedLayout;
using Apitron.PDF.Kit.FixedLayout.Content;
using Apitron.PDF.Kit.FixedLayout.PageProperties;
using FixedLayout.Resources;
using FixedLayout.ContentElements;
/// <summary>
/// Gets all text boundaries.
/// </summary>
/// <param name="elements">The elements.</param>
/// <param name="boundaries">The boundaries.</param>
public void GetAllTextBoundaries(IContentElementsEnumerator elements, IList<Boundary> boundaries, Boundary offset)
{
// We dont count drawings and images here - only text;
if(elements == null)
{
return;
}
foreach (IContentElement element in elements)
{
TextContentElement text = element as TextContentElement;
if (text != null)
{
foreach (TextSegment segment in text.Segments)
{
Boundary currentBoundary = segment.Boundary;
if (offset != null)
{
currentBoundary = new Boundary(currentBoundary.Left + offset.Left, currentBoundary.Bottom + offset.Bottom, currentBoundary.Right + offset.Left, currentBoundary.Top + offset.Bottom);
}
boundaries.Add(currentBoundary);
}
}
else if (element is FormContentElement)
{
Boundary currentBoundary = (element as FormContentElement).Boundary;
if (offset != null)
{
currentBoundary = new Boundary(currentBoundary.Left + offset.Left, currentBoundary.Bottom + offset.Bottom, currentBoundary.Right + offset.Left, currentBoundary.Top + offset.Bottom);
}
this.GetAllTextBoundaries((element as FormContentElement).FormXObject.Elements, boundaries, currentBoundary);
}
}
}
/// <summary>
/// Checks if text is overlapped.
/// </summary>
/// <returns></returns>
public bool CheckIfTextIsOverlapped(string fileName)
{
const double overlapMax = 5;
using (System.IO.Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.ReadWrite))
{
using (FixedDocument document = new FixedDocument(stream))
{
foreach (Page page in document.Pages)
{
IList<Boundary> boundaries = new List<Boundary>();
foreach (Annotation annotation in page.Annotations)
{
// Actually we need only Normal state, but will check all - to be sure.
if(annotation.Appearance.Normal != null)
{
this.GetAllTextBoundaries(annotation.Appearance.Normal.Elements, boundaries, annotation.Boundary);
}
}
IContentElementsEnumerator elements = page.Elements;
this.GetAllTextBoundaries(elements, boundaries, null);
for (int i = 0; i < boundaries.Count; i++)
{
for (int j = i + 1; j < boundaries.Count; j++)
{
Boundary b1 = boundaries[i];
Boundary b2 = boundaries[j];
double x1 = Math.Max(b1.Left, b2.Left);
double y1 = Math.Max(b1.Bottom, b2.Bottom);
double x2 = Math.Min(b1.Right, b2.Right);
double y2 = Math.Min(b1.Top, b2.Top);
// So we have intersection
if (x1 < x2 && y1 < y2)
{
if (x1 - x2 >= overlapMax || y1 - y2 >= overlapMax)
{
return true;
}
}
}
}
}
}
}
return false;
}
I really need some help here. I'm trying to create a program similar to game known as "connecting the dots", where you have dots with numbers from (1...n+1) and you need to connect them with lines.
So I have a panel and I read from the file the coordinates of the dots. But I'm stuck because I can't figure out how to connect the dots with the lines.
My current outcome
To sum up what I want to do:
You press on dot 1 , you press on dot 2 and they connect with a line, otherwise they dont connect.
And you need to conect the dots in order from 1 to n+1.
I hope you will understand me. Tanks a lot in advance!!
private void panel1_Paint(object sender, PaintEventArgs e)
{
List<String> pav1;
pav1 = new List<String>();
StreamReader datafile = new StreamReader("pav1.txt");
int[] X = new int[100];
int[] Y = new int[100];
int k = 0;
string line;
while (datafile.Peek() >= 0)
{
line = datafile.ReadLine();
X[k] = Int16.Parse(line);
line = datafile.ReadLine();
Y[k] = Int16.Parse(line);
k++;
}
datafile.Close();
Brush aBrush = (Brush)Brushes.Black;
for (int i = 0; i < k; i++)
{
e.Graphics.FillEllipse(aBrush, X[i], Y[i], 10, 10);
e.Graphics.DrawString((i + 1).ToString(), new Font("Arial", 10),
System.Drawing.Brushes.Gray, new Point(X[i] + 20, Y[i]));
}
}
First of all, take points out of panel_paint method, and add additional property like ordinal. So, instead of arrays X[] and Y[], you should make class like this:
public class Dot
{
public Point Coordinates { get; set; }
public int Ordinal { get; set; }
}
and then
List<Dot> Dots { get; set; }
Make two props for first and second selected dots
private Dot FirstDot { get; set; }
private Dot SecondDot { get; set; }
Fill that list same way you're filling X[] and Y[] arrays.
Then add OnMouseClick handler on your panel and in it write something like this:
private void panel1_MouseClick(object sender, MouseEventArgs e)
{
//check if user clicked on any of dots in list
var selectedDot = Dots.FirstOrDefault(dot => e.X < dot.Coordinates.X + 10 && e.X > dot.Coordinates.X
&& e.Y < dot.Coordinates.Y + 10 && e.Y > dot.Coordinates.Y);
//dot is found, add it to selected first or second dot property
if (selectedDot != null)
{
if (FirstDot == null)
FirstDot = selectedDot;
else if (SecondDot == null)
SecondDot = selectedDot;
}
}
now, in your paint method you should check if both dots are set, and if they are, check if they are one next to other, something like
if (FirstDot.Ordinal + 1 == SecondDot.Ordinal)
then you can draw lines using
e.Graphics.DrawLine(aBrush, FirstDot.Coordinates, SecondDot.Coordinates);
That should be it. I hope you understand the way how to implement it. Apart from few checks, that should be it.
Use the Graphics.Draw() method, I don't know why you're using ellipse drawing. And your loop should look something like
var myFont = new Font("Arial", 10);
for (int i = 0; i < k; i += 2)
{
var point1 = new Point(X[i], Y[i]);
var point2 = new Point(X[i + 1], Y[i + 1]);
e.Graphics.DrawLine(aBrush, point1, point2);
e.Graphics.DrawString((i + 1).ToString(), myFont, System.Drawing.Brushes.Gray, point1);
e.Graphics.DrawString((i + 2).ToString(), myFont, System.Drawing.Brushes.Gray, point2);
}
Also, point 0, 0 is the upper left corner.
Say I have a given TextRange range that happens to have this text in it ----------------- (On its own line.)
I want to draw a real line whenever I see that text (instead of just 15 dashes).
But, I need to leave the dashes there for when I save it (and when other, plain text viewers load it).
I found how I can draw a line in the RichTextBox:
var line = new Line {X1 = 10, X2 = 200, Y1 = 5, Y2 = 5,
var paragraph = (Paragraph) MyRichTextBox.Document.Blocks.FirstBlock;
paragraph.Inlines.Add(line);
But this just draw after the last Inline in the paragraph.
So, my question is:
How can I draw so that my UIElement does not have text wrapping on (so that I can cover the dashes)?
Is this possible with the WPF RichTextBox?
Could you use "TextDecorations.Strikethrough" for this.
TextRange range = new TextRange(RichTextBox.Selection.Start, RichTextBox.Selection.End);
TextDecorationCollection tdc = (TextDecorationCollection)RichTextBox.Selection.GetPropertyValue(Inline.TextDecorationsProperty);
if (!tdc.Equals(TextDecorations.Strikethrough))
{
tdc = TextDecorations.Strikethrough;
}
range.ApplyPropertyValue(Inline.TextDecorationsProperty, tdc);
I think you have to remove the Inline from within the paragraph and replace it with a Line element. In this case, you will have to replace Line elements with "-----" on save.
private void FindHRules()
{
foreach (Paragraph block in rtf.Document.Blocks.OfType<Paragraph>())
{
var inlines = block.Inlines.ToList();
for(int i = 0; i<inlines.Count; i++)
{
var inline = inlines[i];
TextRange r = new TextRange(inline.ContentStart, inline.ContentEnd);
if (r.Text.StartsWith("--"))
{
Line l = new Line { Stretch = Stretch.Fill, Stroke = Brushes.DarkBlue, X2 = 1 };
block.Inlines.InsertAfter(inline, new InlineUIContainer(l));
block.Inlines.Remove(inline);
}
}
}
}
I tested this with a RTF doc that had the "-----" lines in stand-alone paragraphs (<enter>) and line breaks (<shift-enter>) within other paragraphs.
I'm using C# WinForms and GDI+ to do something I hoped wouldn't be too much problem but...
I'm basically trying to draw a string within a rectangle that has highlighted sections within the string. This all works fine when printing on one line, but I have issues when trying to wrap the text onto the next line within the rectangle.
The algorithm used is as follows: -
Split strings into a collection of highlight and not highlight.
Do
If Highlightedtext Then
DrawString(HighLightedText);
Move X position forward to next character space
Else
DrawString(NormalText);
Move X position forward to next character space
End If
Loop
I would put the code in, but it's messy and long (i'm maintaining it). It'll print out find if the text is one string of either highlighting or not, as it'll wrap it within the bounds of the rectangle without issue if it's too long. If it's multiple highlighting and the string is bigger than the rectangle, it'll write outside of it... this is because the "move X position forward..." just moves the rectangle on which is a problem!
I want to essentially move the point the text is printed within the original rectangle and print it on the next line if wrapping is required. Can anyone assist with this? It's a real pain!
I've managed to sort this by having to make my function do one character at a time.
To do this, I made a function to get an array (which is the length of the string itself) of boolean values which have set any highlighted characters to true.
private bool[] Get_CharacterArray(string text)
{
// Declare the length of the array, all set to false
bool[] characters = new bool[text.Length];
// Get the matching points
List<Point> wordLocs = FindMatchingTerms(text);
wordLocs.Sort(PointComparison);
int position = 0;
foreach (Point loc in wordLocs)
{
// We're only setting the array for matched points
for (position = loc.X; position <= loc.Y; position++)
{
characters[position] = true;
}
}
// Return the array
return characters;
}
(FindMatchingTerms() is a function that will look in the string and return the matches found into a collection).
I then loop this array to draw it out to screen but keeping track of my rectangle border width. When it reduces to the relevant size, I reset the position of drawing back to the start and then move the starting Y position down a bit.
private void RenderFormattedText(Graphics g, RectangleF bounds, string text, string matchText, Font font, Color colour, bool alignTextToTop)
{
const string spaceCharacter = " ";
const string hyphenCharacter = "-";
Font fr = null;
Font fi = null;
try
{
// Get teh matching characters.
bool[] charactersMatched = Get_CharacterArray(text);
// Setup the fonts and bounds.
fr = new Font(font.FontFamily, font.Size, FontStyle.Regular);
fi = new Font(font.FontFamily, font.Size, FontStyle.Bold | FontStyle.Underline);
SizeF fontSize = g.MeasureString(text, fi, 0, StringFormat.GenericTypographic);
RectangleF area = bounds;
// Loop all the characters of the phrase
for (int pos = 0; pos < charactersMatched.Length; pos++)
{
// Draw the character in the appropriate style.
string output = text.Substring(pos, 1);
if (charactersMatched[pos])
{
area.X += DrawFormattedText(g, area, output, fi, colour);
}
else
{
area.X += DrawFormattedText(g, area, output, fr, colour);
}
// Are we towards the end of the line?
if (area.X > (bounds.X + bounds.Width - 1))
{
// are we in the middle of a word?
string preOutput = spaceCharacter;
string postOutput = spaceCharacter;
// Get at the previous character and after character
preOutput = text.Substring(pos - 1, 1);
if ((pos + 1) <= text.Length)
{
postOutput = text.Substring(pos + 1, 1);
}
// Are we in the middle of a word? if so, hyphen it!
if (!preOutput.Equals(spaceCharacter) && !postOutput.Equals(spaceCharacter))
{
if (charactersMatched[pos])
{
area.X += DrawFormattedText(g, area, hyphenCharacter, fi, colour);
}
else
{
area.X += DrawFormattedText(g, area, hyphenCharacter, fr, colour);
}
}
}
// Are we at the end of the line?
if (area.X > (bounds.X + bounds.Width))
{
area.X = bounds.X;
area.Y += fontSize.Height + 2;
}
}
}
finally
{
fr.Dispose();
fi.Dispose();
}
}
Hopefully someone else will find this useful :) I've got some constants in there for spaceCharacter and hypenCharacter which should be self explanatory! There are custom functions to draw the string, but it should make sense nonetheless, hope it helps anyone else.