Read columns of PDF in C# using ItextSharp

Read columns of PDF in C# using ItextSharp - c#

In my progam I extracted text from a PDF file and it works well. ItextSharp extracts text from PDF line by line. However, when a PDF file contains 2 columns, the extracted text is not ok as in each line joins two columns.
My problem is: How can I extract text column by column?
Below is my code. PDF files are Arabic. I'm sorry my English is not so good.
PdfReader reader = new PdfReader(#"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i,
new LocationTextExtractionStrategy());
words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
// other things here
}
// other things here
}

You may want to use RegionTextRenderFilter to restrict a column region then use LocationTextExtractionStrategy to extract the text. However this requires prior knowledge to the PDF file your are parsing, i.e. you need information about the column's position and size.
In more details, you need to pass in the coordinates of your column to define a rectangle, then extract the text from that rectangle. A sample will be like this:
PdfReader reader = new PdfReader(#"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
private string GetColumnText(float llx, float lly, float urx, float ury)
{
// reminder, parameters are in points, and 1 in = 2.54 cm = 72 points
var rect = new iTextSharp.text.Rectangle(llx, lly, urx, ury);
var renderFilter = new RenderFilter[1];
renderFilter[0] = new RegionTextRenderFilter(rect);
var textExtractionStrategy =
new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
renderFilter);
var text = PdfTextExtractor.GetTextFromPage(reader, intPageNum,
textExtractionStrategy);
return text;
}
Here is another post discussing what you want, you may want to check as well: iTextSharp - Reading PDF with 2 columns. But they didn't hit the solution either :(

Related

How to split a PDF by titles with a specific size or font

I'm trying to split a PDF by his titles depending of his size or font.
Currently I could only extract the fonts of a PDF but I need to know the size or font to see if that text is a title. I don't know how to read the PDF by his titles with an specific size or font, I know it can extract the text but is just that, simple text, how do you know which size or font has that text ?
By the way, I have been able to split the PDF by his bookmarks and also the bookmark's "kids"
but I need to split the PDF more deeper. Is that why I'm trying to get the titles to split the PDF by them.
I made some research but I couldn't get something very useful for this case.
Here some questions:
How do you get the size?
How do you get the font?
How do you iterate the PDF line by line?
How do you check a text(line) if has a specific font?
Some code
HashSet<String> fontNames = new HashSet<string>();
PdfDictionary resources;
for (int p = 1; p <= reader.NumberOfPages; p++)
{
PdfDictionary dic = reader.GetPageN(p);
resources = dic.GetAsDict(PdfName.RESOURCES);
if (resources != null)
{
//get fonts dictionary
PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);
if (fonts != null)
{
PdfDictionary font;
foreach (PdfName key in fonts.Keys)
{
font = fonts.GetAsDict(key);
String name = font.GetAsName(PdfName.BASEFONT).ToString();
fontNames.Add(name);
}
}
}
}
Another way
List<object[]> fonts2 = BaseFont.GetDocumentFonts(reader);
Other code: get text
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
words = currentText.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
}

I used this:
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
and the renderInfo parameter has all what I needed it.
renderInfo.GetFont().PostscriptFontName; // Font Name
renderInfo.GetBaseline().GetStartPoint(); // Coordinates - (56.6929 , 727.8466, 1)
renderInfo.GetAscentLine().GetEndPoint(); // Coordinates - (96.78018, 737.1749, 1)

ITextSharp add three points to end of word

I have a table with sensors data. There are sensors with long names that do not fit into a cell.
I want to add three points to end of the long sensor names (like text-overflow: ellipsis in css). I want to do it flexible without hardcoded values. Because in the future number of columns may be different.
How can I do it?
I create table like this:
var table = new PdfPTable(columns.Length);
var widths = new List<float>();
for (var i = 0; i < columns.Length; i++)
{
widths.Add(1f);
}
table.SetWidths(widths.ToArray());
And fill it:
for (var i = 0; i < columns.Length; i++)
{
var cell = new PdfPCell(new Phrase(columns[i], tableDataFont));
cell.UseAscender = true;
cell.HorizontalAlignment = Element.ALIGN_CENTER;
cell.VerticalAlignment = Element.ALIGN_MIDDLE;
cell.BackgroundColor = new Color(204, 204, 204);
cell.MinimumHeight = 20f;
table.AddCell(cell);
}

Something like this should work...
var text = "testtesttesttesttest";
var maxLength = 7;
var displayname = text;
if (text.Length > maxLength)
{
displayname = text.Substring(0, maxLength) + "...";
}
Considering you didn't post any code of your own, I can only provide some logic.
What I do here is just take the real value (insert that into text).
Declare a maximum length you want the text in your fields to be (maxLength).
Create a storage variable to for manipulation of the name without losing the original data (in case you want to keep that).
Then check if the text is longer than the maximumlength and replace all that is too long with "...".
You can then return "displayname" to whatever field you want it to be in.
This most likely isn't what you're looking for, but it does answer your question as it is right now.

Getting PDF page length

In my articles which formatted PDF, one or more pages may be blanked and I want to detect them and remove from PDF file. If I can identify pages that are less than 60 KB, I think I can detect the pages that are empty. Because they're probably empty.
I tried like this:
var reader = new PdfReader("D:\\_test\\file.pdf");
/*
* With reader.FileLength, I can get whole pdf file size.
* But I dont know, how can I get pages'sizes...
*/
for (var i = 1; i <= reader.NumberOfPages; i++)
{
/*
* MessageBox.Show(???);
*/
}

I would do this in 2 steps:
first go over the document using IEventListener to detect which pages are empty
once you've determined which pages are empty, simply create a new document by copying the non-empty pages from the source document into the new document
step 1:
List<Integer> emptyPages = new ArrayList<>();
PdfDocument pdfDocument = new PdfDocument(new PdfReader(new File(SRC)));
for(int i=1;i<pdfDocument.getNumberOfPages();i++){
IsEmptyEventListener l = new IsEmptyEventListener();
new PdfCanvasProcessor(l).processPageContent(pdfDocument.getPage(i));
if(l.isEmptyPage()){
emptyPages.add(i);
}
}
Then you need the proper implementation of IsEmptyEventListener. Which may be tricky and depend on your specific document(s). This is a demo.
class IsEmptyEventListener implements IEventListener {
private int eventCount = 0;
public void eventOccurred(IEventData data, EventType type){
// perhaps count only text rendering events?
eventCount++;
}
public boolean isEmptyPage(){ return eventCount < 32; }
}
step 2:
Based on this example: https://developers.itextpdf.com/examples/stamping-content-existing-pdfs/clone-reordering-pages
void copyNonBlankPages(List<Integer> blankPages, PdfDocument src, PdfDocument dst){
int N = src.getNumberOfPages();
List<Integer> toCopy = new ArrayList<>();
for(int i=1;i<N;i++){
if(!blankPages.contains(i)){
toCopy.add(i);
}
}
src.copyPagesTo(toCopy, dst);
}

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

I already read all related StackOverflow and haven't find a decent solution to this. I want to open a PDF, get the text (words) and their coordinates then further, add a sticky note to some of them.
Seems to be mission impossible, I'm stucked.
How come this code will correctly find all words in a page (but not their coordinates)?
using (PdfReader reader = new PdfReader(path))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 5; page <= 5; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
Console.WriteLine(text);
}
//txt = sb.ToString();
}
But this one gets coordinates, but for "chunks" that cannot rely they are in proper order.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
//strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
// new MyLocationTextExtractionStrategy("sample", System.Globalization.CompareOptions.None)
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
if (chunk.m_text.Trim() == "MCU_MOSI")
Console.WriteLine("Bingo"); // <-- NEVER HIT
}
//Console.WriteLine(strategy.m_SearchResultsList.ToString()); // strategy.GetResultantText() +
}
This uses a class from this post (little modified by me)
Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
But only finds useless "chunks".
So the question is can with iTextSharp really locate words in page so I can add some sticky notes nearby? Thank you.

It looks like the chunk.m_text only contains one letter at a time which is why it this will never be true:
if (chunk.m_text.Trim() == "MCU_MOSI")
What you could do instead is have each chunk text added to a string and see if it contains your text.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
string str = string.Empty;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
var x = strategy.m_SearchResultsList;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
str += chunk.m_text;
if (str.Contains("MCU_MOSI"))
{
str = string.Empty;
Vector location = chunk.m_endLocation;
Console.WriteLine("Bingo");
}
}
}
Note for the example of the location, I made m_endLocation public.

With iTextSharp, how to rightalign a container of text (but keeping the individual lines leftaligned)

I don't know the width of the texts in the textblock beforehand, and I want the the textblock to be aligned to the right like this, while still having the individual lines left-aligned:
Mr. Petersen |
Elmstreet 9 |
888 Fantastic City|
(| donotes the right edge of the document)
It should be simple, but I can't figure it out.
I've tried to put all the text in a paragraph and set paragraph.Alignment = Element.ALIGN_RIGHT, but this will rightalign the individual lines.
I've tried to put the paragraph in a cell inside a table, and rightalign the cell, but the cell just takes the full width of the table.
If I could just create a container that would take only the needed width, I could simply rightalign this container, so maybe that is really my question.

Just set the Paragraph.Align property:
using (Document document = new Document()) {
PdfWriter.GetInstance(
document, STREAM
);
document.Open();
for (int i = 1; i < 11; ++i) {
Paragraph p = new Paragraph(string.Format(
"Paragraph {0}", i
));
p.Alignment = Element.ALIGN_RIGHT;
document.Add(p);
}
}
It even works with a long string like this:
string longString = #"
iText ® is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
";
Paragraph pLong = new Paragraph(longString);
pLong.Alignment = Element.ALIGN_RIGHT;
document.Add(pLong);
EDIT:
After looking at the "picture" you drew...
It doesn't match with the title. The only way you can align individual Paragraph objects like your picture is if the "paragraph" does NOT exceed the Document object's "content" box (for a lack of a better term). In other words, you won't be able to get that type of alignment if the amount of text exceeds that which will fit on a single line.
With that said, if you want that type of alignment you need to:
Calculate the widest value from the collection of strings you intend to use.
Use that value to set a common left indentation value for the Paragraphs.
Something like this:
using (Document document = new Document()) {
PdfWriter.GetInstance(
document, STREAM
);
document.Open();
List<Chunk> chunks = new List<Chunk>();
float widest = 0f;
for (int i = 1; i < 5; ++i) {
Chunk c = new Chunk(string.Format(
"Paragraph {0}", Math.Pow(i, 24)
));
float w = c.GetWidthPoint();
if (w > widest) widest = w;
chunks.Add(c);
}
float indentation = document.PageSize.Width
- document.RightMargin
- document.LeftMargin
- widest
;
foreach (Chunk c in chunks) {
Paragraph p = new Paragraph(c);
p.IndentationLeft = indentation;
document.Add(p);
}
}
UPDATE 2:
After reading your updated question, here's another option that lets you add text to the left side of the "container":
string textBlock = #"
Mr. Petersen
Elmstreet 9
888 Fantastic City
".Trim();
// get the longest line to calcuate the container width
var widest = textBlock.Split(
new string[] {Environment.NewLine}
, StringSplitOptions.None
)
.Aggregate(
"", (x, y) => x.Length > y.Length ? x : y
)
;
// throw-away Chunk; used to set the width of the PdfPCell containing
// the aligned text block
float w = new Chunk(widest).GetWidthPoint();
PdfPTable t = new PdfPTable(2);
float pageWidth = document.PageSize.Width
- document.LeftMargin
- document.RightMargin
;
t.SetTotalWidth(new float[]{ pageWidth - w, w });
t.LockedWidth = true;
t.DefaultCell.Padding = 0;
// you can add text in the left PdfPCell if needed
t.AddCell("");
t.AddCell(textBlock);
document.Add(t);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read columns of PDF in C# using ItextSharp - c#

Related

How to split a PDF by titles with a specific size or font

ITextSharp add three points to end of word

Getting PDF page length

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

With iTextSharp, how to rightalign a container of text (but keeping the individual lines leftaligned)

Categories

Resources