Determine PDF Orientation Using iTextSharp. c#.net - c#

I am using iTextSharp in a c# Windows App to manipulate scanned portrait PDF invoice files. After scanning the files I'd like to automatically check (estimate) the orientation of the text on the page (user may have scanned upside down).
Invoices come from a variety of suppliers so I can't search for standard text or an image.
I was thinking that if I can could crop the PDF page in two (top and bottom), and create two new PDF files I could then compare the two file sizes. The largest file would probably be the top of the page. I could then rotate (I know how to do this bit) the page if required.
Thanks
Update - I have found a way to split the page in half but unfortunately the 2 files that are created are the same size (even though there are more text and images in the top half) :
private void TrimDocument()
{
//derived from http://www.namedquery.com/cropping-pdf-using-itextsharp
PdfReader pdfReader = new PdfReader("C:/Docman/RawScans/PDFWeightedTop.pdf");
PdfRectangle rect = new PdfRectangle(0, pdfReader.GetPageSizeWithRotation(1).Height / 2, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height); //Top
//***PdfRectangle rect = new PdfRectangle(0, 0, pdfReader.GetPageSizeWithRotation(1).Width, pdfReader.GetPageSizeWithRotation(1).Height/2); //Bottom
//***FileStream output = new FileStream("C:/Docman/Matched/top.pdf", FileMode.Create);
FileStream output = new FileStream("C:/Docman/Matched/bottom.pdf", FileMode.Create);
Document doc = new Document(PageSize.A4);
//Make a copy of the document
PdfSmartCopy smartCopy = new PdfSmartCopy(doc, output);
doc.Open();
var page = pdfReader.GetPageN(1);
page.Put(PdfName.CROPBOX, rect);
page.Put(PdfName.MEDIABOX, rect);
var copiedPage = smartCopy.GetImportedPage(pdfReader, 1);
smartCopy.AddPage(copiedPage);
doc.Close();
}

Off the top of my head there are a few ways you could go about determining the documents orientation, each with their own pros/cons of efficiency, accuracy, and effort/cost.
Use an OCR package such as Tesseract or Cuneiform and scan the page in one orientation and then again rotated 180. Since OCR packages will only detect correctly oriented text, whichever orientation captured more text is the correct orientation. This method may not be the most efficient but it would probably be the most accurate. There are many other OCR packages, consult Wikipedia.
Expose the contents of the jpeg in the PDF document via iTextSharp.text.Image.RawData property, cast it to monochrome and then use various scoring functions to assess areas of greater ink density. You will need to experiment here, but first thing that comes to mind is to detect the heading/logo in your invoice since that will most likely be at the top and will have a greater density than the bottom. Another idea is maybe there is always a footer, bar code, or tracking number and you could scan that portion of the page in either orientation. It's presence could be used as a flag.
You could use a pixel difference technique and build a composite mask (image) of all documents you know which have the correct orientation and use that mask to perform a bitwise XOR with your unknown image, and again with the opposite orientation, and compare the sum of black pixels in each. The theory being that the unknown image will be in the domain of known images and if it is oriented correctly should have very few differences, but if oriented incorrectly will have many differences.
If you have a known domain of invoices you could detect a feature of each invoice which indicates its orientation, similar to how a vending machine detects the type of bill you insert.
Mechanical Turk :)
Some combination of the above.
Good Luck, let us know how you proceed!

Related

PDF Table Structure

I have a PDF file with tabular structure but I am not able to store it in database as the PDF file is in Mangal font.
So two problems occur to me:
Extract table data from PDF
Text is in Marathi language
I have managed to do this for English with the following code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
text.Append(currentText);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
This encoding gives tabular structure but only for English font, want to know for Marathi.
Funnily enough, requirement no. 1 is actually the hardest.
In order to understand why, you need to understand PDF a bit.
PDF is not a WYSIWYG format. If you open a PDF file in notepad (or notepad++), you'll see that it doesn't seem to contain any human-readable information.
In fact, PDF contains instructions that tell a viewer program (like Adobe) how to render the PDF.
So instead of having an actual table in there (like you might expect in an HTML document), it will contain stuff like:
draw a line from .. to ..
go to position ..
draw the characters '123'
set the font to Helvetica bold
go to position ..
draw a line from .. to ..
draw the characters '456'
etc
See also How does TextRenderInfo work in iTextSharp?
In order to extract the table from the PDF, you need to do several things.
implement IEventListener (this is a class that you can attach to a Parser instance, a Parser will go over the entire page, and notify all listeners of things like TextRenderInfo, ImageRenderInfo and PathRenderInfo events)
watch out for PathRenderInfo events
build a datastructure that tracks which paths are being drawn
as soon as you detect a cluster of lines that is at roughly 90° angles, you can assume a table is being drawn
determine the biggest bounding box that fits the cluster of lines (this is know as the convex hull problem, and the algorithm to solve it is called the gift wrapping algorithm)
now you have a rectangle that tells you where (on the page) the table is located.
you can now recursively apply the same logic within the table to determine rows and columns
you can also keep track of TextRenderInfo events, and sort them into bins depending on the rectangles that fit each individual cell of the table
This is a lot of work. None of this is trivial. In fact this is the kind of stuff people write phd theses about.
iText has a good implementation of most of these algorithms in the form of the pdf2Data tool.
Code:
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy);
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
Then I have identified lines (Horizontal and Vertical) from PDF. As for lines PDF has either re or m and l Keywords.
Then I worked for marathi text which I got from iTextSharp.
Then I merged both for desired location I extract the text using code-
Int64 width = Convert.ToInt64(linesVertical[5].StartPoint.X) - Convert.ToInt64(linesVertical[2].StartPoint.X);
Int64 height = Convert.ToInt64(linesVertical[2].EndPoint.Y) - (Convert.ToInt64(linesVertical[2].StartPoint.Y));
System.util.RectangleJ rect = new System.util.RectangleJ(Convert.ToInt64(linesVertical[2].StartPoint.X), (800 - Convert.ToInt64(linesVertical[2].EndPoint.Y) + 150), width, height);
RenderFilter[] renderFilter = new RenderFilter[1];
renderFilter[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
Owner_Name = PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);

iText PDF PArser does not parse the data as a whole word with octet-stream

I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.

Decreasing the impact on a PDF's file size using iTextSharp to highlight text

I was able to successfully use the following code to highlight text in an existing PDF:
private static void highlightDiff(PdfStamper stamper, Rectangle rectangle, int page)
{
float[] quadPoints = { rectangle.Left, rectangle.Bottom, rectangle.Right, rectangle.Bottom, rectangle.Left, rectangle.Top, retangle.Right, rectangle.Top };
PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rectangle, null, PdfAnnotation.MARKUP_HIGHLIGHT, quadPoints);
highlight.Color = BaseColor.RED;
stamper.AddAnnotation(highlight, page);
}
The problem is I'm highlighting characters at a time and my guess is a new layer is added every time I call this function because the resulting file size is significantly larger after the program has completed running.
I tried to add the following lines at the end of the function and maybe it's just me but it seemed to have sped up the time it takes the PDF to load when I go to view it but the size of the file still remains exceedingly large.
stamper.FreeTextFlattening = true;
I may try to make my code more efficient and decrease the number of calls I make (if the characters I'm highlighting are next to each other, find the combined rectangle and call) but was wondering if there was another way around this. Thanks in advance!
Each time you execute highlightDiff you add a new highlight annotation to the PDF. Inside the PDF such an annotation is an object like this:
1 0 obj
<<
/Rect[204.68 705.11 211.2 716.11]
/Subtype/Highlight
/Contents()
/QuadPoints[204.68 716.11 211.2 716.11 204.68 705.11 211.2 705.11]
/C[1 0 0]
/P 2 0 R
>>
Furthermore there needs to be a reference to this object from the page description plus an entry in the internal cross references.
Thus, each such call makes the PDF grow by nearly 200 bytes. If you highlight many such individual characters, the file indeed will grow considerably.
I may try to make my code more efficient and decrease the number of calls I make (if the characters I'm highlighting are next to each other, find the combined rectangle and call) but was wondering if there was another way around this.
If you indeed want your highlighting to be done using highlighting annotations, there is not way around this.
If you on the other hand would also accept highlighting rectangles to be drawn in the regular page content, you may see less file size growth using that approach. Even then, though, first combining neighboring rectangles would reduce file size (and PDF viewer resource requirements) considerably.

Best way to get image dimensions using ImageResizer

I am switching an existing MVC 4 website from home-cooked user file uploads to resizing files with ImageResizer as they are uploaded.
I see in the documentation that I should not use System.Drawing, but I can't figure out any other way of grabbing the image dimensions.
It does not matter if the dimensions are from the original image or a resized image, since I am preserving aspect ratio and merely need to determine if an image is landscape or portrait.
I am adding the code here that I refer to in my comment responding to #Nathanael's answer.
ImageJob ij = new ImageJob(file, requestedImageInfo: null);
int ? y = ij.SourceWidth;
int ? z = ij.SourceHeight;
If you can store the image dimensions during upload (from ImageJob.SourceWidth/Height or LoadImageInfo), that is best, as reading image dimensions from a file involves lots of I/O.
If not, ImageResizer offers the IDictionary LoadImageInfo(object source, IEnumerable requestedInfo) method to do so after the fact. Just keep in mind, it does involve reading from disk, and you don't want to call this lots of times in a single HTTP request. Put those numbers in the database.
You can always calculate the final size of an image via ImageBuilder.GetFinalSize(originalSize, instructions). This, on the other hand, is very fast, as it involves no I/O, just math.

iTextSharp reporting text position incorrectly

I'm working on a text extraction system from PDF files using iTextSharp. I have already created a class that implements ITextExtractionStrategy and implemented methods like RenderText(), GetResultantText() etc. I have studied LocationTextExtractionStrategy class provided by iTextSharp itself as well.
The problem I'm facing is that for a particular PDF document, the RenderText() method reports the horizontal position of a few text chunks incorrectly. This happens for around 15-20 chunks out of a total of 700+ text chunks available on the page. I'm using the following simple code to get text position in RenderText():
Vector curBaselineStart = renderInfo.GetBaseline().GetStartPoint();
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
chunks.Add(location);
After collecting all the text chunks, I try to draw them on a bitmap, using Graphics class and the following simple loop:
for (int k = 0; k < chunks.Count; k++)
{
var ch = chunks[k];
g.DrawString(ch.text, fnt, Brushes.Black, ch.startLocation[Vector.I1], bmp.Height - ch.startLocation[Vector.I2], StringFormat.GenericTypographic);
}
The problem happens with the X (horizontal) dimension only for these few text chunks. They appear slightly towards the left than their actual position. Was wondering if there's something wrong with my code here.
Shujaat
Finally figured this out. In PDF, computing actual text positions is more complicated than simply getting the baseline co-ordinates. You need to incorporate character and word spacing, horizontal and vertical scaling and some other factors too. I did some correspondance with iText guys and they have now incorporated a new method in TextRenderInfo class that provides actual character-by-character positions by taking care of all of the above factors.

Categories

Resources