I'm working on a text extraction system from PDF files using iTextSharp. I have already created a class that implements ITextExtractionStrategy and implemented methods like RenderText(), GetResultantText() etc. I have studied LocationTextExtractionStrategy class provided by iTextSharp itself as well.
The problem I'm facing is that for a particular PDF document, the RenderText() method reports the horizontal position of a few text chunks incorrectly. This happens for around 15-20 chunks out of a total of 700+ text chunks available on the page. I'm using the following simple code to get text position in RenderText():
Vector curBaselineStart = renderInfo.GetBaseline().GetStartPoint();
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
chunks.Add(location);
After collecting all the text chunks, I try to draw them on a bitmap, using Graphics class and the following simple loop:
for (int k = 0; k < chunks.Count; k++)
{
var ch = chunks[k];
g.DrawString(ch.text, fnt, Brushes.Black, ch.startLocation[Vector.I1], bmp.Height - ch.startLocation[Vector.I2], StringFormat.GenericTypographic);
}
The problem happens with the X (horizontal) dimension only for these few text chunks. They appear slightly towards the left than their actual position. Was wondering if there's something wrong with my code here.
Shujaat
Finally figured this out. In PDF, computing actual text positions is more complicated than simply getting the baseline co-ordinates. You need to incorporate character and word spacing, horizontal and vertical scaling and some other factors too. I did some correspondance with iText guys and they have now incorporated a new method in TextRenderInfo class that provides actual character-by-character positions by taking care of all of the above factors.
Related
Im trying to build an ocr/ocv app. it works well. But in real world scenarios, printed text is not perfect, it has some defects like ink spread or a cut in between. Inkspread is manageable, but im stuck at how to join the 2 parts of a character when there is a cut, like in the image below:
I find contours before I do ocr/ocv, :
using (VectorOfVectorOfPoint contours = new VectorOfVectorOfPoint())
{
CvInvoke.FindContours(binaryimg, contours, null, RetrType.External, ChainApproxMethod.ChainApproxSimple);
int count = contours.Size;
for (int i = 0; i < count; i++)
{
double perimeter = CvInvoke.ArcLength(contours[i], true);
VectorOfPoint approx = new VectorOfPoint();
CvInvoke.ApproxPolyDP(contours[i], approx, 0.04 * perimeter, true);
CvInvoke.DrawContours(mainimg, contours, i, new MCvScalar(0, 255, 0), 2);
Rectangle r = CvInvoke.BoundingRectangle(approx);
id++;
int area = r.Width * r.Height;
int width = r.Width;
int height = r.Height;
}
}
I get the height and width of the character, and inside those rectangle I do OCR and OCV. When there is a cut in character, it gets detected as 2 contours. How do I join those? I tried open and closing but it didnt help much.
I'm not very familiar with OCR, but my understanding is that the first step is usually to use a text-detector to find parts of the image covered by text to form Regions of interest to feed to the OCR algorithm. Creating per character ROIs seem like it could be problematic since some fonts may have very small or even no spacing between some characters. Some OCR engines may benefit from knowing the font.
Assuming the print defects cover more than one character, one option would be to detect and repair that specific defect. You could for example apply a horizontal blur filter, do some thresholding and line detection. Once you found a defect you could try to repair it by finding places where it seem to cut characters, and fill in those places.
Another approach might be to retrain the neural network on your particular dataset and defects to try to improve accuracy.
But it is very likely some errors will still occur. So in the end you might still need to have a human proof-reading the result, or have some system to inform the operator about sections the algorithm is uncertain about.
I have a PDF file with tabular structure but I am not able to store it in database as the PDF file is in Mangal font.
So two problems occur to me:
Extract table data from PDF
Text is in Marathi language
I have managed to do this for English with the following code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
text.Append(currentText);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
This encoding gives tabular structure but only for English font, want to know for Marathi.
Funnily enough, requirement no. 1 is actually the hardest.
In order to understand why, you need to understand PDF a bit.
PDF is not a WYSIWYG format. If you open a PDF file in notepad (or notepad++), you'll see that it doesn't seem to contain any human-readable information.
In fact, PDF contains instructions that tell a viewer program (like Adobe) how to render the PDF.
So instead of having an actual table in there (like you might expect in an HTML document), it will contain stuff like:
draw a line from .. to ..
go to position ..
draw the characters '123'
set the font to Helvetica bold
go to position ..
draw a line from .. to ..
draw the characters '456'
etc
See also How does TextRenderInfo work in iTextSharp?
In order to extract the table from the PDF, you need to do several things.
implement IEventListener (this is a class that you can attach to a Parser instance, a Parser will go over the entire page, and notify all listeners of things like TextRenderInfo, ImageRenderInfo and PathRenderInfo events)
watch out for PathRenderInfo events
build a datastructure that tracks which paths are being drawn
as soon as you detect a cluster of lines that is at roughly 90° angles, you can assume a table is being drawn
determine the biggest bounding box that fits the cluster of lines (this is know as the convex hull problem, and the algorithm to solve it is called the gift wrapping algorithm)
now you have a rectangle that tells you where (on the page) the table is located.
you can now recursively apply the same logic within the table to determine rows and columns
you can also keep track of TextRenderInfo events, and sort them into bins depending on the rectangles that fit each individual cell of the table
This is a lot of work. None of this is trivial. In fact this is the kind of stuff people write phd theses about.
iText has a good implementation of most of these algorithms in the form of the pdf2Data tool.
Code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
Then I have identified lines (Horizontal and Vertical) from PDF. As for lines PDF has either re or m and l Keywords.
Then I worked for marathi text which I got from iTextSharp.
Then I merged both for desired location I extract the text using code-
Int64 width = Convert.ToInt64(linesVertical[5].StartPoint.X) - Convert.ToInt64(linesVertical[2].StartPoint.X);
Int64 height = Convert.ToInt64(linesVertical[2].EndPoint.Y) - (Convert.ToInt64(linesVertical[2].StartPoint.Y));
System.util.RectangleJ rect = new System.util.RectangleJ(Convert.ToInt64(linesVertical[2].StartPoint.X), (800 - Convert.ToInt64(linesVertical[2].EndPoint.Y) + 150), width, height);
RenderFilter[] renderFilter = new RenderFilter[1];
renderFilter[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
Owner_Name = PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);
I'm drawing a path on a PDF page using PDFsharp. I add all points to an XGraphicsPath path and then draw the path on an XGraphics. However, the XGraphicsPath always closes my path (so it always connects the end point to the begin point). Is it possible to not connect the end points so that I have an "open" path? I coundn't find this functionality in the documentation of PDFsharp.
Thanks in advance!
I haven't tried it myself, but it seems you have to do nothing special to get an open path:
http://pdfsharp.net/wiki/Graphics-sample.ashx#Stroke_an_open_path_12
You do not show any code so no-one else can try your code.
I know you asked this 3 years ago but I have been using GDI+ and PDFSharp some lately and I had this exact problem so maybe I can help someone.
I had a loop drawing a bunch of polylines in GDI+. I decided to try to optimize this by changing a loop of DrawLines calls every time my control was being drawn into a one-time loop of GraphicsPath.AddLines(PointF[]) calls.
As soon as I made this change, calling DrawPath would draw all my lines (which used to be separated) as one long continuous path from start to end.
I added a GraphicsPath.StartFigure() call before the addlines and it broke apart all my polylines and drew them the way I intended.
GraphicsPath myShapes;
for (int i = 0; i < PLineCount; i++)
{
PointF[] points = new PointF[PLineCount];
for (int ii = 0; ii < PLinePointCount; ii++)
{
points[ii] = new PointF(X, Y);
}
myShapes.StartFigure(); // This is what I added to break the line segments apart.
myShapes.AddLines(points);
}
This code won't run as written, you have to add code to provide the numbers of points in the polylines, and the X,Y coordinates of each point. Whether you have predefined shapes or you want to randomly generate them is up to you.
I have verified XGraphics has StartFigure just like GDI+ so I think it likely this will solve your issue.
Here's images of the undesirable version, and the one that was fixed by adding StartFigure before each shape.
GraphicsPath screenshots :
I am using iTextSharp in a c# Windows App to manipulate scanned portrait PDF invoice files. After scanning the files I'd like to automatically check (estimate) the orientation of the text on the page (user may have scanned upside down).
Invoices come from a variety of suppliers so I can't search for standard text or an image.
I was thinking that if I can could crop the PDF page in two (top and bottom), and create two new PDF files I could then compare the two file sizes. The largest file would probably be the top of the page. I could then rotate (I know how to do this bit) the page if required.
Thanks
Update - I have found a way to split the page in half but unfortunately the 2 files that are created are the same size (even though there are more text and images in the top half) :
private void TrimDocument()
{
//derived from http://www.namedquery.com/cropping-pdf-using-itextsharp
PdfReader pdfReader = new PdfReader("C:/Docman/RawScans/PDFWeightedTop.pdf");
PdfRectangle rect = new PdfRectangle(0, pdfReader.GetPageSizeWithRotation(1).Height / 2, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height); //Top
//***PdfRectangle rect = new PdfRectangle(0, 0, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height/2); //Bottom
//***FileStream output = new FileStream("C:/Docman/Matched/top.pdf", FileMode.Create);
FileStream output = new FileStream("C:/Docman/Matched/bottom.pdf", FileMode.Create);
Document doc = new Document(PageSize.A4);
//Make a copy of the document
PdfSmartCopy smartCopy = new PdfSmartCopy(doc, output);
doc.Open();
var page = pdfReader.GetPageN(1);
page.Put(PdfName.CROPBOX, rect);
page.Put(PdfName.MEDIABOX, rect);
var copiedPage = smartCopy.GetImportedPage(pdfReader, 1);
smartCopy.AddPage(copiedPage);
doc.Close();
}
Off the top of my head there are a few ways you could go about determining the documents orientation, each with their own pros/cons of efficiency, accuracy, and effort/cost.
Use an OCR package such as Tesseract or Cuneiform and scan the page in one orientation and then again rotated 180. Since OCR packages will only detect correctly oriented text, whichever orientation captured more text is the correct orientation. This method may not be the most efficient but it would probably be the most accurate. There are many other OCR packages, consult Wikipedia.
Expose the contents of the jpeg in the PDF document via iTextSharp.text.Image.RawData property, cast it to monochrome and then use various scoring functions to assess areas of greater ink density. You will need to experiment here, but first thing that comes to mind is to detect the heading/logo in your invoice since that will most likely be at the top and will have a greater density than the bottom. Another idea is maybe there is always a footer, bar code, or tracking number and you could scan that portion of the page in either orientation. It's presence could be used as a flag.
You could use a pixel difference technique and build a composite mask (image) of all documents you know which have the correct orientation and use that mask to perform a bitwise XOR with your unknown image, and again with the opposite orientation, and compare the sum of black pixels in each. The theory being that the unknown image will be in the domain of known images and if it is oriented correctly should have very few differences, but if oriented incorrectly will have many differences.
If you have a known domain of invoices you could detect a feature of each invoice which indicates its orientation, similar to how a vending machine detects the type of bill you insert.
Mechanical Turk :)
Some combination of the above.
Good Luck, let us know how you proceed!
I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.
I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.
Rectangle rect = new Rectangle(
pdfArray.GetAsNumber(0).FloatValue,
pdfArray.GetAsNumber(1).FloatValue,
pdfArray.GetAsNumber(2).FloatValue,
pdfArray.GetAsNumber(3).FloatValue);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;
The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.
The cause
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:
/**
* Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
* #return A list of {#link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
* #since 5.3.3
*/
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net
Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.
A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class TextRenderInfoSplitter:
package stackoverflow.itext.extraction;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class TextRenderInfoSplitter implements TextExtractionStrategy
{
public TextRenderInfoSplitter(TextExtractionStrategy strategy)
{
this.strategy = strategy;
}
public void renderText(TextRenderInfo renderInfo)
{
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
strategy.renderText(info);
}
}
public void beginTextBlock()
{
strategy.beginTextBlock();
}
public void endTextBlock()
{
strategy.endTextBlock();
}
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
}
If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:
String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));
I tested it with the PDF created in this answer for the area
Rectangle rect = new Rectangle(200, 600, 200, 135);
For reference I marked the area in the PDF:
Text extraction filtered by area without the TextRenderInfoSplitter results in:
I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox
Text extraction filtered by area with the TextRenderInfoSplitter results in:
to create a PDF f
ntents in the docu
n g P D F
BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:
public boolean allowText(TextRenderInfo renderInfo){
LineSegment segment = renderInfo.getBaseline();
Vector startPoint = segment.getStartPoint();
Vector endPoint = segment.getEndPoint();
float x1 = startPoint.get(Vector.I1);
float y1 = startPoint.get(Vector.I2);
float x2 = endPoint.get(Vector.I1);
float y2 = endPoint.get(Vector.I2);
return filterRect.intersectsLine(x1, y1, x2, y2);
}
Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter by copying RegionTextRenderFilter and then replacing the line
return filterRect.intersectsLine(x1, y1, x2, y2);
by
return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.
Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the /QuadPoints entry in the dictionary.
Why are they this way?
This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.
As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.
In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.
Almost all text-related code was refactored to use this code.
Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.
How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.