iTextSharp library does not extract text from my file - c#

iTextSharp library (version 5.5.5) does not extract text from my file.
I can copy and paste text from pdf into Notepad.
I uploaded file to this link.
The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.
var text = string.Empty;
using (var file = new File.OpenRead(path))
{
using (var reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
}
}
}
Any help is highly appreciated.

The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.
Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.
Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.
To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.
I only tested this using the Java version of iText as I am more proficient with it.
iText 5.x/Java
The Maven coordinates for the 5.x version of this jar artifact:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext-asian</artifactId>
<version>5.2.0</version>
</dependency>
(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)
After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.
iTextSharp 5.x/.Net
There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)
Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.
As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):
Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}

I have addition to the answer given by #mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}

Related

Visual Studio extension: get language / file type from file name or ProjectItem

I'm writing an extension to provide basic project statistics (e.g. lines of code). It's simple enough to iterate a Solution tree and find the ProjectItems that correspond with files.
The Document structure has Kind and Language properties, but the latter is marked for internal use only, and both require the file to be opened in the editor first.
So... is there a way to:
See what files Visual Studio will classify as text files.
See what language Visual Studio associates with a given file name / file extension.
without opening the file?
I have written such statistics (although only for C# and VB.NET) and both questions are very tricky for all project types / file types / languages. First of all, if you need, you can open an EnvDTE.Document / EnvDTE.TextDocument from an EnvDTE.ProjectItem using the ProjectItem.Open(view) method, which returns an EnvDTE.Window. That doesn't make the window visible, by default is invisible, you would need to call Window.Visible = true to make it visible. When done, you close the (invisible) window with Window.Close, unless it was already opened (you can know calling first ProjectItem.get_IsOpen(view) and later closing or not accordingly).
Now:
It is very difficult to know if a file is text or not because VS supports many projects, and each project type can consider its files/extensions as text files or not. The best approach that I found is to consider all files as text files unless known extensions that are not text files (.jpg, etc.). Also, notice that not all text files are code files (ex: .txt files). For some features such as a find text feature you may be interested in text files but for an statistics feature you may be interested in code files, not just text files.
You can know the guid of the language of a file using EnvDTE.ProjectItem.FileCodeModel.Language (and EnvDTE.Project.CodeModel.Language). Alas, some project / files have language but do not provide a code model, so you may need to use known extensions to map to a language.
Some useful language guids:
const string LANGUAGE_CSHARP = "{B5E9BD34-6D3E-4B5D-925E-8A43B79820B4}";
const string LANGUAGE_IDL = "{B5E9BD35-6D3E-4B5D-925E-8A43B79820B4}";
const string LANGUAGE_MANAGED_C = "{B5E9BD36-6D3E-4B5D-925E-8A43B79820B4}";
const string LANGUAGE_VBNET = "{B5E9BD33-6D3E-4B5D-925E-8A43B79820B4}";
const string LANGUAGE_VISUAL_C = "{B5E9BD32-6D3E-4B5D-925E-8A43B79820B4}";
const string LANGUAGE_PYTHON = "{888888A0-9F3D-457C-B088-3A5042F75D52}";
const string LANGUAGE_FSHARP = "{F2A71F9B-5D33-465A-A702-920D77279786}";
const string LANGUAGE_R = "{DA7A21FA-8162-4350-AD77-A8D1B671F3ED}";
Notice that being VS so extensible, there is no enum for languages. New languages provide new guids.

iText7 Dynamically Set Field Calculation At Run-time

Working with iText7 library version 7.0.2.2 in a c# web application. A PDF document is produced with n-number of dynamically created pages based on the amount of data.
Is there a way to set a field with a calculated formula at run time? So for example, something along the lines of having a subtotal field calculation like
the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice.
Itext is merely creating the view, the data model feeding it is provided by you.
Furthermore, itext only draws text strings, not numerical types, so it would have to parse those strings back to numbers which can be difficult considering all the ways numbers can be formatted with commas, periods, plus and minus signs, brackets, units,...
And the text pieces you draw with itext are not named.
And itext flushes contents to the output as soon as possible to save memory.
...
So no, itext does not provide support for "the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice" or similar expressions.

iText PDF PArser does not parse the data as a whole word with octet-stream

I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.

How can I parse through a table in a pdf file?

I have a custom table with name, firstname, place of birth and place of living in a PDF file which I want to parse through in C#. One of the simplest way of doing it would be:
using (PdfLoadedDocument document = new PdfLoadedDocument("foobar"))
{
for (var i = 0; i < document.Pages.Count; i++)
{
Console.WriteLine($"============ PAGE NO. {i+1} ============");
Console.WriteLine(document.Pages[i].ExtractText());
}
}
But the problem is the output:
============ PAGE NO. 38 ============
John L.SmithSan Francisco5400 Baden
There's no way I can seperate this with a regex so I need a way to parse through each column of each row in order to get all the values of the customers separated. How can I parse through a table in a pdf file with syncfusion?
You will need a methods that returns you the coordinate of each character found in the pdf. Then you have some math to do (basically to compute the distance between characters) in order to know if the character is part of a word and where the word itself is located along the x-axe. It requires quite a lot of work and efforts and I didn't find such a method in syncfusion documentation.
I wrote a class which do what you want but this is for java project:
PDFLayoutTextStripper (upon PDFBox)
Syncfusion control extracting the text from PDF document based on the structure of content present in the PDF document. So, based on current implementation of Syncfusion control we cannot recognize the rows and columns present in the table of the PDF document.
Also, it is not possible to extract the text in correct order as same as the PDF document displayed using Syncfusion control since the content present in the PDF document follows fixed layout.
But we can populate the table of the PDF document in Excel using Tabula (Open source library). I have modified the Tabula java (Open Source) to achieve layout based text extraction from the PDF document based on your requirement.
Please find the sample for this implementation in below link:
http://www.syncfusion.com/downloads/support/directtrac/171585/ze/TextExtractionSample649531336
Kindly ensure the following things before executing the sample:
Install Java Runtime Environment (JRE) from the below link.
http://www.oracle.com/technetwork/java/javase/downloads/
Restart your machine.
Execute the above sample.
Try this and check whether it meets your requirement.

Parsing Complex PDF document with C#

See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.
Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.
var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
string[] lines;
var strategy = new LocationTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);
I also tried playing with Annotation parsing but didn't have luck.
I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?
Thanks a lot.
You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.
To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.
Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.
To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:
You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
You parse the same page in a LocationTextExtractionStrategy instance.
You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.
(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)
All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.
The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.
in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:
store coordinates of each "box" in the config file
process documents and exract text from every "box" (i.e. region)
additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)
In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)
This approach should work with any PDF library.
Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.
I don't think it can read annotations though.

Categories

Resources