iText7 Dynamically Set Field Calculation At Run-time - c#

Working with iText7 library version 7.0.2.2 in a c# web application. A PDF document is produced with n-number of dynamically created pages based on the amount of data.
Is there a way to set a field with a calculated formula at run time? So for example, something along the lines of having a subtotal field calculation like
the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice.

Itext is merely creating the view, the data model feeding it is provided by you.
Furthermore, itext only draws text strings, not numerical types, so it would have to parse those strings back to numbers which can be difficult considering all the ways numbers can be formatted with commas, periods, plus and minus signs, brackets, units,...
And the text pieces you draw with itext are not named.
And itext flushes contents to the output as soon as possible to save memory.
...
So no, itext does not provide support for "the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice" or similar expressions.

Related

iText PDF PArser does not parse the data as a whole word with octet-stream

I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.

Parsing Complex PDF document with C#

See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.
Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.
var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
string[] lines;
var strategy = new LocationTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);
I also tried playing with Annotation parsing but didn't have luck.
I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?
Thanks a lot.
You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.
To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.
Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.
To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:
You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
You parse the same page in a LocationTextExtractionStrategy instance.
You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.
(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)
All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.
The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.
in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:
store coordinates of each "box" in the config file
process documents and exract text from every "box" (i.e. region)
additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)
In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)
This approach should work with any PDF library.
Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.
I don't think it can read annotations though.

Looping Through Large KML Text Files

I am a novice at programming, working on a C# solution for a geomorphology project. I need to extract coordinates from a variable number of Google Earth KML ground overlay files, converted to one long text string, and enter them into an array that can be accessed by other methods.
The KML tags and data of interest look like this:
<LatLonBox>
<north>37.91904192681665</north>
<south>37.46543388598137</south>
<east>15.35832653742206</east>
<west>14.60128369746704</west>
<rotation>-0.1556640799496235</rotation>
</LatLonBox>
The text files I will be processing with the program could have between 1 and a 100 or more of these data groups, each embedded within the standard KML file headers/footers and other tags extraneous for my work. I have already developed the method for extracting the coordinate values as strings and have tested it for one KML file.
At this point it seems that the most efficient approach would be to construct some kind of looping method to search through the string for a coordinate data group, extract the data to a row in the array, then continue to the next group. The method might also go through the string and extract all the "north" data to a column in the array first, then loop back for all the "south" data, etc. I am open to any suggestions.
Due to my limited programming background, straight-forward solutions would be preferred over elegant or advanced solutions, but give it your best shot.
Thanks for your help.

Get text position in Microsoft Word from VBA or C# Interop

I want to access the position and size for each indivisible unit in Microsoft Word. Examples of such units include individual characters, images, etc.
The purpose is to apply a visual overlay based on unit position and size. I will have no knowledge of the content in target documents.
Imagine the text of this question in a word document. I need to be able to iterate each character including white-space and carriage returns and get the size and position.
EDIT
It doesn't matter whether your answer considers macros, interop, add-ins or OLE embedding.
The method which retrieves displayed coordinates of an object is Window.GetPoint (link for the office interop version, same thing in VBA).
As for the "indivisible unit," you can put any meaning you want into that, using the available collections.
For instance, if you want it to be characters, you can use Document.Range.Characters, which is a collection of characters, each of which is a Range.
Or you can use Document.Range.InlineShapes for the pictures that are part of text.
Or Document.Range.ShapeRange to enumerate "floating" shapes.
At which point you might be thinking about Window.RangeFromPoint to figure an object from its window coordinates.

How to read a text file into a List in C#

I have a text file that has the following format:
1234
ABC123 1000 2000
The first integer value is a weight and the next line has three values, a product code, weight and cost, and this line can be repeated any number of times. There is a space in between each value.
I have been able to read in the text file, store the first value on the first line into a variable, and then the subsequent lines into an array and then into a list, using first readline.split('').
To me this seems an inefficient way of doing it, and I have been trying to find a way where I can read from the second line where the product codes, weights and costs are listed down into a list without the need of using an array. My list control contains an object where I am only storing the weight and cost, not the product code.
Does anyone know how to read in a text file, take in some values from the file straight into a list control?
Thanks
What you do is correct. There is no generalized way of doing it, since what you did is that you descirbed the algorithm for it, that has to be coded or parametrized somehow.
Since your text file isn't as structured as a CSV file, this kind of manual parsing is probably your best bet.
C# doesn't have a Scanner class like Java, so what you wan't doesn't exist in the BCL, though you could write your own.
The other answers are correct - there's no generalized solution for this.
If you've got a relatively small file, you can use File.ReadAllLines(), which will at least get rid of a lot cruft code, since it'll immediately convert it to a string array for you.
If you don't want to parse strings from the file and to reserve an additional memory for holding split strings you can use a binary format to store your information in the file. Then you can use the class BinaryReader with methods like ReadInt32(), ReadDouble() and others. It is more efficient than read by characters.
But one thing: binary format is bad readable by humans. It will be difficult to edit the file in the editor. But programmatically - without any problems.

Categories

Resources