I have a PDF file with tabular structure but I am not able to store it in database as the PDF file is in Mangal font.
So two problems occur to me:
Extract table data from PDF
Text is in Marathi language
I have managed to do this for English with the following code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
text.Append(currentText);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
This encoding gives tabular structure but only for English font, want to know for Marathi.
Funnily enough, requirement no. 1 is actually the hardest.
In order to understand why, you need to understand PDF a bit.
PDF is not a WYSIWYG format. If you open a PDF file in notepad (or notepad++), you'll see that it doesn't seem to contain any human-readable information.
In fact, PDF contains instructions that tell a viewer program (like Adobe) how to render the PDF.
So instead of having an actual table in there (like you might expect in an HTML document), it will contain stuff like:
draw a line from .. to ..
go to position ..
draw the characters '123'
set the font to Helvetica bold
go to position ..
draw a line from .. to ..
draw the characters '456'
etc
See also How does TextRenderInfo work in iTextSharp?
In order to extract the table from the PDF, you need to do several things.
implement IEventListener (this is a class that you can attach to a Parser instance, a Parser will go over the entire page, and notify all listeners of things like TextRenderInfo, ImageRenderInfo and PathRenderInfo events)
watch out for PathRenderInfo events
build a datastructure that tracks which paths are being drawn
as soon as you detect a cluster of lines that is at roughly 90° angles, you can assume a table is being drawn
determine the biggest bounding box that fits the cluster of lines (this is know as the convex hull problem, and the algorithm to solve it is called the gift wrapping algorithm)
now you have a rectangle that tells you where (on the page) the table is located.
you can now recursively apply the same logic within the table to determine rows and columns
you can also keep track of TextRenderInfo events, and sort them into bins depending on the rectangles that fit each individual cell of the table
This is a lot of work. None of this is trivial. In fact this is the kind of stuff people write phd theses about.
iText has a good implementation of most of these algorithms in the form of the pdf2Data tool.
Code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
Then I have identified lines (Horizontal and Vertical) from PDF. As for lines PDF has either re or m and l Keywords.
Then I worked for marathi text which I got from iTextSharp.
Then I merged both for desired location I extract the text using code-
Int64 width = Convert.ToInt64(linesVertical[5].StartPoint.X) - Convert.ToInt64(linesVertical[2].StartPoint.X);
Int64 height = Convert.ToInt64(linesVertical[2].EndPoint.Y) - (Convert.ToInt64(linesVertical[2].StartPoint.Y));
System.util.RectangleJ rect = new System.util.RectangleJ(Convert.ToInt64(linesVertical[2].StartPoint.X), (800 - Convert.ToInt64(linesVertical[2].EndPoint.Y) + 150), width, height);
RenderFilter[] renderFilter = new RenderFilter[1];
renderFilter[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
Owner_Name = PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);
Related
I have a PDF in which data is displayed in a table. In this table, I have multiple columns, but I want to get particular column values as a list. Is this possible?
This is my code:
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
return text.ToString();
With this code, I get all of the text of the PDF, but I want a particular column of data. The column name is "Date".
This is much more complicated than you would think.
A PDF document does not (always) contain structure information. It only has instructions that a viewer needs to render the document.
Imagine something like:
go to 50, 50
use font Helvetica Bold
draw the glyph for character 'H'
go to 56, 50
draw the glyph for character 'e'
These instructions do not even need to appear in logical reading order.
As a result, determining what makes up a logical table, based on instructions is very hard.
Possible approach (if your table contains enough lines):
use IEventListener to be notified of PathRenderInfo and TextRenderInfo
gather PathRenderInfo into lines
gather lines into clusters if (and only if) they cross at 90° angles
determine number of rows and columns from such line clusters
assume something is a table if (and only if) it consists of enough rows and columns and has some text in it
I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.
I am using iTextSharp in a c# Windows App to manipulate scanned portrait PDF invoice files. After scanning the files I'd like to automatically check (estimate) the orientation of the text on the page (user may have scanned upside down).
Invoices come from a variety of suppliers so I can't search for standard text or an image.
I was thinking that if I can could crop the PDF page in two (top and bottom), and create two new PDF files I could then compare the two file sizes. The largest file would probably be the top of the page. I could then rotate (I know how to do this bit) the page if required.
Thanks
Update - I have found a way to split the page in half but unfortunately the 2 files that are created are the same size (even though there are more text and images in the top half) :
private void TrimDocument()
{
//derived from http://www.namedquery.com/cropping-pdf-using-itextsharp
PdfReader pdfReader = new PdfReader("C:/Docman/RawScans/PDFWeightedTop.pdf");
PdfRectangle rect = new PdfRectangle(0, pdfReader.GetPageSizeWithRotation(1).Height / 2, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height); //Top
//***PdfRectangle rect = new PdfRectangle(0, 0, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height/2); //Bottom
//***FileStream output = new FileStream("C:/Docman/Matched/top.pdf", FileMode.Create);
FileStream output = new FileStream("C:/Docman/Matched/bottom.pdf", FileMode.Create);
Document doc = new Document(PageSize.A4);
//Make a copy of the document
PdfSmartCopy smartCopy = new PdfSmartCopy(doc, output);
doc.Open();
var page = pdfReader.GetPageN(1);
page.Put(PdfName.CROPBOX, rect);
page.Put(PdfName.MEDIABOX, rect);
var copiedPage = smartCopy.GetImportedPage(pdfReader, 1);
smartCopy.AddPage(copiedPage);
doc.Close();
}
Off the top of my head there are a few ways you could go about determining the documents orientation, each with their own pros/cons of efficiency, accuracy, and effort/cost.
Use an OCR package such as Tesseract or Cuneiform and scan the page in one orientation and then again rotated 180. Since OCR packages will only detect correctly oriented text, whichever orientation captured more text is the correct orientation. This method may not be the most efficient but it would probably be the most accurate. There are many other OCR packages, consult Wikipedia.
Expose the contents of the jpeg in the PDF document via iTextSharp.text.Image.RawData property, cast it to monochrome and then use various scoring functions to assess areas of greater ink density. You will need to experiment here, but first thing that comes to mind is to detect the heading/logo in your invoice since that will most likely be at the top and will have a greater density than the bottom. Another idea is maybe there is always a footer, bar code, or tracking number and you could scan that portion of the page in either orientation. It's presence could be used as a flag.
You could use a pixel difference technique and build a composite mask (image) of all documents you know which have the correct orientation and use that mask to perform a bitwise XOR with your unknown image, and again with the opposite orientation, and compare the sum of black pixels in each. The theory being that the unknown image will be in the domain of known images and if it is oriented correctly should have very few differences, but if oriented incorrectly will have many differences.
If you have a known domain of invoices you could detect a feature of each invoice which indicates its orientation, similar to how a vending machine detects the type of bill you insert.
Mechanical Turk :)
Some combination of the above.
Good Luck, let us know how you proceed!
I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.
I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.
Rectangle rect = new Rectangle(
pdfArray.GetAsNumber(0).FloatValue,
pdfArray.GetAsNumber(1).FloatValue,
pdfArray.GetAsNumber(2).FloatValue,
pdfArray.GetAsNumber(3).FloatValue);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;
The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.
The cause
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:
/**
* Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
* #return A list of {#link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
* #since 5.3.3
*/
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net
Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.
A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class TextRenderInfoSplitter:
package stackoverflow.itext.extraction;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class TextRenderInfoSplitter implements TextExtractionStrategy
{
public TextRenderInfoSplitter(TextExtractionStrategy strategy)
{
this.strategy = strategy;
}
public void renderText(TextRenderInfo renderInfo)
{
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
strategy.renderText(info);
}
}
public void beginTextBlock()
{
strategy.beginTextBlock();
}
public void endTextBlock()
{
strategy.endTextBlock();
}
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
}
If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:
String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));
I tested it with the PDF created in this answer for the area
Rectangle rect = new Rectangle(200, 600, 200, 135);
For reference I marked the area in the PDF:
Text extraction filtered by area without the TextRenderInfoSplitter results in:
I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox
Text extraction filtered by area with the TextRenderInfoSplitter results in:
to create a PDF f
ntents in the docu
n g P D F
BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:
public boolean allowText(TextRenderInfo renderInfo){
LineSegment segment = renderInfo.getBaseline();
Vector startPoint = segment.getStartPoint();
Vector endPoint = segment.getEndPoint();
float x1 = startPoint.get(Vector.I1);
float y1 = startPoint.get(Vector.I2);
float x2 = endPoint.get(Vector.I1);
float y2 = endPoint.get(Vector.I2);
return filterRect.intersectsLine(x1, y1, x2, y2);
}
Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter by copying RegionTextRenderFilter and then replacing the line
return filterRect.intersectsLine(x1, y1, x2, y2);
by
return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.
Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the /QuadPoints entry in the dictionary.
Why are they this way?
This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.
As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.
In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.
Almost all text-related code was refactored to use this code.
Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.
How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.
I'm working on a text extraction system from PDF files using iTextSharp. I have already created a class that implements ITextExtractionStrategy and implemented methods like RenderText(), GetResultantText() etc. I have studied LocationTextExtractionStrategy class provided by iTextSharp itself as well.
The problem I'm facing is that for a particular PDF document, the RenderText() method reports the horizontal position of a few text chunks incorrectly. This happens for around 15-20 chunks out of a total of 700+ text chunks available on the page. I'm using the following simple code to get text position in RenderText():
Vector curBaselineStart = renderInfo.GetBaseline().GetStartPoint();
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
chunks.Add(location);
After collecting all the text chunks, I try to draw them on a bitmap, using Graphics class and the following simple loop:
for (int k = 0; k < chunks.Count; k++)
{
var ch = chunks[k];
g.DrawString(ch.text, fnt, Brushes.Black, ch.startLocation[Vector.I1], bmp.Height - ch.startLocation[Vector.I2], StringFormat.GenericTypographic);
}
The problem happens with the X (horizontal) dimension only for these few text chunks. They appear slightly towards the left than their actual position. Was wondering if there's something wrong with my code here.
Shujaat
Finally figured this out. In PDF, computing actual text positions is more complicated than simply getting the baseline co-ordinates. You need to incorporate character and word spacing, horizontal and vertical scaling and some other factors too. I did some correspondance with iText guys and they have now incorporated a new method in TextRenderInfo class that provides actual character-by-character positions by taking care of all of the above factors.