Skip section of PDF when extracting text and Get image coordinates - c#

Currently, I can extract all the text chunks with their location data from a PDF. The problem is that the PDF contains images with text annotations which I do not want including in the extraction.
However, for whatever reason whenever I search the PDF for images, it only finds 1 of the images and usually throws the exception: The colour space is not supported. It's as if it doesn't recognise them as images?
I am not wishing to extract the images, just locate where they start and end in relation to the PDF so I can exempt the text that is on top of the images.
For example:
Where the numbers on the graph are unwanted and need to be removed from the extracted text.
Im just not sure how to:
A) Locate all the images and store the coordinates of where it starts and ends
B) Ignore the text that is on top of the images in the PDF document
(I am using iTextSharp to try and achieve this, but so far I am not having much luck)

I'm not exactly sure how iTextSharp works but the PostScript language reference or the PDF Reference manuals may be a good place to start figuring out what you need to know.
I just cracked open a PDF file in a text editor to check out the format because I haven't seen it in a while and then realized what the problem might be.
PDFs support "Images", and "Stream Objects" which can contain image data. Stream objects actually declare enough information that you can know where they begin and end and write something to manually ignore them.
A Stream Object Header looks like this:
<</Intent/RelativeColorimetric/Subtype/Image/Length 19678/Filter/DCTDecode/Name/X/Metadata 4314 0 R/BitsPerComponent 8/ColorSpace 5247 0 R/Width 290/Height 372/Type/XObject>>stream
It's entirely possible that your particular PDF has only one "Image" and then the rest of it is "Streams".
I suggest cracking it open to take a look. It would also be beneficial if you included some sample code with on the library you're using.
I also found by opening a PDF in a text editor this string /Type /Page which seems to create new pages, so you there's a chance you could count those to determine which page you're currently on.
The header at the top of the document I'm reviewing is %PDF-1.2 and the latest version is 1.7, so there may be some disparity here because of that.
Any chance you can share the PDF file you're working with?

Related

Get position of text within PDF using iTextSharp 4.1.6

I've been tasked with creating "Signature" fields within PDF files that we generate. Some of these files are actually various word docs that get pieced together (if they apply) that are then converted to PDF's. Because of this, the position of the "Signature" field could be anywhere (vertically) on the page. We currently have iTextSharp version 4.1.6 and have no intentions of upgrading. I have successfully used this library for many of our other forms when I know the exact position of the field prior to processing.
The route we are thinking of taking is placing some white text where we would want the Signature field located. I would parse the PDF searching for this text and then place a field there. I am by no means a PDF expert and have spent some time looking into the various tokens within the file hoping that there is something I can use but have come up with no answers.
Using this outdated version, is there any way I am able to come up with a solution to finding and placing a field or am I wasting my time?

Itextsharp : OnEndPage event not called

I am trying to add header in existing PDF file with help of PDFPageEventHelper. I am getting error document has no pages in some files. The code works perfectly for other files. While debugging I found that for some files OnEndPage method is not called which may cause to throw exception 'document has no pages'
Any idea why this method or event (OnEndPage) is not called ?
Maybe your PDF files do not contain any page information. PDF renders content into bounding boxes. All you do is to define a box and render stuff into it. Therefore you do not need any page information.
Out there - in da real world - exists a lot of crazy pages. Some declare a box so that (0,0) is in the middle of the box. Perfect for drawing functions, but some libraries fail, because they think only in pages starting with left/top corner as (0,0). And such boxes can be transformed multiple times inside a document.
Many PDF documents contain a lot of parts which break PDF rules. There are some tools on the markets, which will validate your files against the PDF references. A few try to fix them. A cheap workaround may be to read the PDF into libre office and save it again as PDF. This will fix only a small set of errors, but yours may be among them.
You have to read the failing documents in a text editor to find the reason. But it is a pain in the a*

itext create pdf based on existing one with changed content

I got quite complicated ready pdf file. It has got barcode and fancy looking table.
I have to create based on it application which will generate pdfs that will look the same but contain different records in the table and different barcode.
Is it possible to copy existing pdf and just change content of barcode and table ?
What would be the best approach to create the same looking pdf but with different content ?
Whank You very much for help
If the barcode and table are static I would open it in photoshop or illustrator delete everything I dont want, Then save it as a pdf again. Then follow this guide iText - add content to existing PDF file and use it as a template to put my custom content in.
If the table and bar code are dynamically generated (each one is different) and you need to crop out content on the fly I would pull some hacky crap and draw white squares over all the content I want gone. then proceed to use it as a template.
Just my 2 cents given the information provided.

How to generate an RTF document server side in c#

I have tried using the System.Windows.Documents.FlowDocument server side, but ran into a problem with images.
What I need to produce is a document with headings, section breaks, page breaks, images (with text wrapping around from the left or the right), tables and ideally some kind of table of contents.
I use c# and asp.net.
Is there a library that will do most of this?
RTF has been chosen because the document needs to be openable in older versions of word, be editable, and we can't run word on the server.
Thank-you
I used MigraDoc in the past, it is a free library. You can create PDFs or RTFs. Just Google it.
I have started using .net rtf writer.
It produces clean rtf, but doesn't do everything I need.
There is pretty good documentation for rtf here.
I am working some things out for my self. For example, I needed to be able to wrap text around an image. Whilst the rtf writer above enables you to add images to documents, it does so by putting the image in its own paragraph. What I need is a shape element.
In the rtf it ends up looking something like this (some of the numbers define the size and position of the image in twips):
{\shp{\*\shpinst\shpleft3801\shptop1\shpright8300\shpbottom4500\shpfhdr0\shpbxcolumn\shpbxignore\shpbypara\shpbyignore\shpwr2\shpwrk0\shpfblwtxt0\shpz0
{\sp
{\sn pib}
{\sv
{\pict\pngblip\pichgoal4499\picwgoal4499
-- image binary data goes here --
}}}
{\sp
{\sn fLine}
{\sv 0}}}}
I sometimes just save something in word and try and understand what it did (but word seems to add a lot of noise).

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.
If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.
Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.
Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Categories

Resources