Parsing Complex PDF document with C#

Parsing Complex PDF document with C# - c#

See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.
Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.
var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
string[] lines;
var strategy = new LocationTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);
I also tried playing with Annotation parsing but didn't have luck.
I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?
Thanks a lot.

You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.
To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.
Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.
To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:
You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
You parse the same page in a LocationTextExtractionStrategy instance.
You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.
(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)
All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.

The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.
in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:
store coordinates of each "box" in the config file
process documents and exract text from every "box" (i.e. region)
additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)
In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)
This approach should work with any PDF library.

Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.
I don't think it can read annotations though.

Related

iText7 Dynamically Set Field Calculation At Run-time

Working with iText7 library version 7.0.2.2 in a c# web application. A PDF document is produced with n-number of dynamically created pages based on the amount of data.
Is there a way to set a field with a calculated formula at run time? So for example, something along the lines of having a subtotal field calculation like
the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice.

Itext is merely creating the view, the data model feeding it is provided by you.
Furthermore, itext only draws text strings, not numerical types, so it would have to parse those strings back to numbers which can be difficult considering all the ways numbers can be formatted with commas, periods, plus and minus signs, brackets, units,...
And the text pieces you draw with itext are not named.
And itext flushes contents to the output as soon as possible to save memory.
...
So no, itext does not provide support for "the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice" or similar expressions.

iText PDF PArser does not parse the data as a whole word with octet-stream

I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain

When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.

Looping Through Large KML Text Files

I am a novice at programming, working on a C# solution for a geomorphology project. I need to extract coordinates from a variable number of Google Earth KML ground overlay files, converted to one long text string, and enter them into an array that can be accessed by other methods.
The KML tags and data of interest look like this:
<LatLonBox>
<north>37.91904192681665</north>
<south>37.46543388598137</south>
<east>15.35832653742206</east>
<west>14.60128369746704</west>
<rotation>-0.1556640799496235</rotation>
</LatLonBox>
The text files I will be processing with the program could have between 1 and a 100 or more of these data groups, each embedded within the standard KML file headers/footers and other tags extraneous for my work. I have already developed the method for extracting the coordinate values as strings and have tested it for one KML file.
At this point it seems that the most efficient approach would be to construct some kind of looping method to search through the string for a coordinate data group, extract the data to a row in the array, then continue to the next group. The method might also go through the string and extract all the "north" data to a column in the array first, then loop back for all the "south" data, etc. I am open to any suggestions.
Due to my limited programming background, straight-forward solutions would be preferred over elegant or advanced solutions, but give it your best shot.
Thanks for your help.

Get text position in Microsoft Word from VBA or C# Interop

I want to access the position and size for each indivisible unit in Microsoft Word. Examples of such units include individual characters, images, etc.
The purpose is to apply a visual overlay based on unit position and size. I will have no knowledge of the content in target documents.
Imagine the text of this question in a word document. I need to be able to iterate each character including white-space and carriage returns and get the size and position.
EDIT
It doesn't matter whether your answer considers macros, interop, add-ins or OLE embedding.

The method which retrieves displayed coordinates of an object is Window.GetPoint (link for the office interop version, same thing in VBA).
As for the "indivisible unit," you can put any meaning you want into that, using the available collections.
For instance, if you want it to be characters, you can use Document.Range.Characters, which is a collection of characters, each of which is a Range.
Or you can use Document.Range.InlineShapes for the pictures that are part of text.
Or Document.Range.ShapeRange to enumerate "floating" shapes.
At which point you might be thinking about Window.RangeFromPoint to figure an object from its window coordinates.

How do I extract sections (multiple sections per page, multiple pages) of a word document/pdf/image as separate images/word documents/pdfs?

Here's the basic problem: I have about 10,000 word documents that contain blocks of data. Each block is numbered and also has an accompanying image. I need to somehow store these individual blocks to a db as images (text would be great, but read note below), without the numbering.
I can go through and have typists mark the beginning and ends of the blocks using a ###QUESTIONSTART###, ###QUESTIONEND### or whatever. I am trying to take that document, convert it to a big image, look for those tags, extract the part in between the tags as an image and then move on to the next block.
I've been looking at some APIs and I think I can definitely crop the images once I figure out how to get the coordinates of each start/end marker. Any suggestions? I'd hate to write a pixel by pixel matcher that has to go O(no of blocks * n^2)
NOTE: These blocks contain complex equations/math type stuff hence the images. I don't have the $$ to get 1000 typists trained in TeX and retype the whole deal. OCR doesn't cut it yet.

I don't understand all your question, but in my impression, Tika can help you.

If you can have typists add block marks to 10,000 documents, why can't the typists
Open the Word document
Copy the image from the Word document
Paste the image into Paint
Save the image to their disk?
You can come up with a image naming scheme that makes sense to you and your typists.
Then you can collect the images from the disk drives with a program and load them into your database.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.