Get page number from Word document - c#

I'm using GemBox.Document and I need to find out on what page is my bookmark located inside the Word document. Can this be done?
If not, then can I find out the page on which some specific text is located?
I can find both bookmark and text, but I don't see any option that lets me get the page number from that.
DocumentModel document = DocumentModel.Load("My Document.docx");
Bookmark bookmark = document.Bookmarks["My Bookmark"];
ContentRange content = document.Content.Find("My Text").First();

This is a somewhat uncommon task for Word files, you see these files themselves do not have a page concept, they are of a flow-document type, the page concept is specific to a Word application which is rendering it (like Microsoft Word).
The flow-document types (DOC, DOCX, RTF, HTML, etc. formats) define content in a flow-able manner, it's designed for easier editing.
On the other hand, the fixed-document types (PDF, XPS, etc. formats) have a page concept because the content is fixed, it specifies on which page and on which location some specific content will be rendered, it's designed to be rendered the same when being viewed on any application or any screen.
Nevertheless, here is how you can obtain the page number from some ContentPosition using GemBox.Document:
static int GetPageNumber(ContentPosition position)
{
DocumentModel document = position.Parent.Document;
Field pageField = new Field(document, FieldType.Page);
Field importedPageField = position.InsertRange(pageField.Content).Parent as Field;
document.GetPaginator(new PaginatorOptions() { UpdateFields = true });
int pageNumber = int.Parse(importedPageField.Content.ToString());
importedPageField.Content.Delete();
return pageNumber;
}
Also, here is how you can use it:
DocumentModel document = DocumentModel.Load("My Document.docx");
Bookmark bookmark = document.Bookmarks["My Bookmark"];
ContentRange content = document.Content.Find("My Text").First();
int bookmarkPageNumber = GetPageNumber(bookmark.Start.Content.Start);
int contentPageNumber = GetPageNumber(content.Start);
Last, note that the GetPaginator method is a somewhat heavy task (basically, it is similar to saving the whole document to PDF), it can be expensive when you have a rather large document.
So, if you need to use GetPageNumber multiple times (for example, to find out the page number of each bookmark that you have), then you should consider changing the code so that you first import all the page fields that you need and then call the GetPaginator method just once and then read the content of all those page fields.

Related

iText PDF PArser does not parse the data as a whole word with octet-stream

I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.

How can I parse through a table in a pdf file?

I have a custom table with name, firstname, place of birth and place of living in a PDF file which I want to parse through in C#. One of the simplest way of doing it would be:
using (PdfLoadedDocument document = new PdfLoadedDocument("foobar"))
{
for (var i = 0; i < document.Pages.Count; i++)
{
Console.WriteLine($"============ PAGE NO. {i+1} ============");
Console.WriteLine(document.Pages[i].ExtractText());
}
}
But the problem is the output:
============ PAGE NO. 38 ============
John L.SmithSan Francisco5400 Baden
There's no way I can seperate this with a regex so I need a way to parse through each column of each row in order to get all the values of the customers separated. How can I parse through a table in a pdf file with syncfusion?
You will need a methods that returns you the coordinate of each character found in the pdf. Then you have some math to do (basically to compute the distance between characters) in order to know if the character is part of a word and where the word itself is located along the x-axe. It requires quite a lot of work and efforts and I didn't find such a method in syncfusion documentation.
I wrote a class which do what you want but this is for java project:
PDFLayoutTextStripper (upon PDFBox)
Syncfusion control extracting the text from PDF document based on the structure of content present in the PDF document. So, based on current implementation of Syncfusion control we cannot recognize the rows and columns present in the table of the PDF document.
Also, it is not possible to extract the text in correct order as same as the PDF document displayed using Syncfusion control since the content present in the PDF document follows fixed layout.
But we can populate the table of the PDF document in Excel using Tabula (Open source library). I have modified the Tabula java (Open Source) to achieve layout based text extraction from the PDF document based on your requirement.
Please find the sample for this implementation in below link:
http://www.syncfusion.com/downloads/support/directtrac/171585/ze/TextExtractionSample649531336
Kindly ensure the following things before executing the sample:
Install Java Runtime Environment (JRE) from the below link.
http://www.oracle.com/technetwork/java/javase/downloads/
Restart your machine.
Execute the above sample.
Try this and check whether it meets your requirement.

How to get words from current visible page of ms word document?

I'm developing C# addin for MS Word. I can grab all words of current document - it's something like that:
app = (Word._Application )Application; // Application object comes on addin's connection
foreach(Word.Word word in app.Application.Words)
{
doSmth(word);
}
My question, is how to grab all words not from entire document but from current active(visible for user) page?
In other words, I need to define active page/paragraph of app.Application.ActiveDocument and do something with "active" words.
Interesting question. [See update at end]
Word's object model doesn't really have a "page" object, because the pagination of the document is constantly changing as you add and remove content (or change the font size, the paper size, etc.). So, there is no "ActiveDocument.Pages(1)" sort of thing.
What's more, there's no easy way to tell what page is currently displayed. In part, that's because the user doesn't necessarily see only one page at a time. He may be viewing the end of one page and the start of the next, or several pages may be displayed - depending on his view settings.
If I can make the question slightly easier, then perhaps I can answer it in a way that helps you. Let me re-define "current active (visible for user) page" as the page where the selection is. (Actually, since the selection can span several pages, let's define it as "the page where the active end of the selection is").
I'll also answer using VBA because it's easier to play around with it in the VBA immediate window, and it's trivial to convert to C# when you need to (it's the same object model, after all).
Word's Selection object has the properties of a Range, and if you simply wanted all the selected words, then this would be trivial (Selection.Words!). However, if we want all the words on that page, then we need to work a little harder.
First, let's find out what page the (start of the) selection is on. For this, we can use the Information method:
pageNumber = Selection.Information(wdActiveEndPageNumber)
So now we know what page we're interested in. We now need to get a Range object that includes all the text on that page. We need to do this in two steps - by finding first the start and then the end of that range.
To find the start of the range, we can use the Goto function, which returns a Range object representing the start of a specified item:
startOfRange = ActiveDocument.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageNumber).Start
The end of the range is either the start of the next page (minus one character, but let's not quibble), or the end of the document (if we're on the last page):
If pageNumber = ActiveDocument.Content.Information(wdNumberOfPagesInDocument) Then
endOfRange = ActiveDocument.Content.End
Else
endOfRange = ActiveDocument.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageNumber + 1).Start
End If
Now we can construct a Range object encompassing all the text on the page:
Set pageRange = ActiveDocument.Range(startOfRange, endOfRange)
... and from there we can get the words:
Set words = pageRange.Words
Here is a short VBA macro that uses the above technique to report the number of words on the active page:
Sub Test()
Dim pageNumber As Integer
Dim startOfRange As Integer
Dim endOfRange As Integer
Dim pageRange As Range
pageNumber = Selection.Information(wdActiveEndPageNumber)
startOfRange = ActiveDocument.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageNumber).Start
If pageNumber = ActiveDocument.Content.Information(wdNumberOfPagesInDocument) Then
endOfRange = ActiveDocument.Content.End
Else
endOfRange = ActiveDocument.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageNumber + 1).Start
End If
Set pageRange = ActiveDocument.Range(startOfRange, endOfRange)
MsgBox pageRange.Words.Count
End Sub
UPDATE
OK, it turns out that there's a much easier way to do this. Word has a "special bookmark" that points to the text on the current page, so this will do the same as all that code above:
words = ActiveDocument.Bookmarks("\page").Range.Words

Prevent Word document's fields from updating when opened

I wrote a utility for another team that recursively goes through folders and converts the Word docs found to PDF by using Word Interop with C#.
The problem we're having is that the documents were created with date fields that update to today's date before they get saved out. I found a method to disable updating fields before printing, but I need to prevent the fields from updating on open.
Is that possible? I'd like to do the fix in C#, but if I have to do a Word macro, I can.
As described in Microsoft's endless maze of documentation you can lock the field code. For example in VBA if I have a single date field in the body in the form of
{DATE \# "M/d/yyyy h:mm:ss am/pm" \* MERGEFORMAT }
I can run
ActiveDocument.Fields(1).Locked = True
Then if I make a change to the document, save, then re-open, the field code will not update.
Example using c# Office Interop:
Word.Application wordApp = new Word.Application();
Word.Document wordDoc = wordApp.ActiveDocument;
wordDoc.Fields.Locked = 1; //its apparently an int32 rather than a bool
You can place the code in the DocumentOpen event. I'm assuming you have an add-in which subscribes to the event. If not, clarify, as that can be a battle on its own.
EDIT: In my testing, locking fields in this manner locks them across all StoryRanges, so there is no need to get the field instances in headers, footers, footnotes, textboxes, ..., etc. This is a surprising treat.
Well, I didn't find a way to do it with Interop, but my company did buy Aspose.Words and I wrote a utility to convert the Word docs to TIFF images. The Aspose tool won't update fields unless you explicitly tell it to. Here's a sample of the code I used with Aspose. Keep in mind, I had a requirement to convert the Word docs to single page TIFF images and I hard-coded many of the options because it was just a utility for myself on this project.
private static bool ConvertWordToTiff(string inputFilePath, string outputFilePath)
{
try
{
Document doc = new Document(inputFilePath);
for (int i = 0; i < doc.PageCount; i++)
{
ImageSaveOptions options = new ImageSaveOptions(SaveFormat.Tiff);
options.PageIndex = i;
options.PageCount = 1;
options.TiffCompression = TiffCompression.Lzw;
options.Resolution = 200;
options.ImageColorMode = ImageColorMode.BlackAndWhite;
var extension = Path.GetExtension(outputFilePath);
var pageNum = String.Format("-{0:000}", (i+1));
var outputPageFilePath = outputFilePath.Replace(extension, pageNum + extension);
doc.Save(outputPageFilePath, options);
}
return true;
}
catch (Exception ex)
{
LogError(ex);
return false;
}
}
I think a new question on SO is appropriate then, because this will require XML processing rather than just Office Interop. If you have both .doc and .docx file types to convert, you might require two separate solutions: one for WordML (Word 2003 XML format), and another for OpenXML (Word 2007/2010/2013 XML format), since you cannot open the old file format and save as the new without the fields updating.
Inspecting the OOXML of a locked field shows us this w:fldLock="1" attribute. This can be inserted using appropriate XML processing against the document, such as through the OOXML SDK, or through a standard XSLT transform.
Might be helpful: this how-do-i-unlock-a-content-control-using-the-openxml-sdk-in-a-word-2010-document question might be similar situation but for Content Controls. You may be able to apply the same solution to Fields, if the the Lock and LockingValues types apply the same way to fields. I am not certain of this however.
To give more confidence that this is the way to do it, see example of this vendor's solution for the problem. If you need to develop this in-house, then openxmldeveloper.org is a good place to start - look for Eric White's examples for manipulating fields such as this.

OpenXML: Anyway to see if a Word Document fits one page

While I doubt it, if I open up a word document using OpenXML sdk in C# and add some info, is there any way for me to see if it still fits one page?
If it doesn't I wan't to reduce font size on specific items I added until it fits.
I could write this algorithm if I had the current size in relation to page size with margins and all that.
I ran across this example on another site, don't know if it'll work in your case, as it requires the Office PIA...
var app = new Word.Application();
var doc = app.Documents.Open("path/to/file");
doc.Repaginate()
var pageNumber = doc.BuiltInDocumentProperties("Number of Pages").Value as int;

Categories

Resources