counting page breaks in a word doc using word interp - c#

I've been searching the internet on how to get the page breaks in a word doc but to no avail. Microsoft offers little help on this topic. I'd appreciate any help in getting the number of page breaks using word interop. I'm using winform.
Thanks

You can count the page breaks by searching for ^012, like so:
int totalPageBreaks = 0;
Microsoft.Office.Interop.Word.Range rng;
rng = doc.Range();
rng.Collapse(WdCollapseDirection.wdCollapseStart);
while (true) {
rng.Find.ClearFormatting();
rng.Find.Text = "^012";
rng.Find.Forward = true;
rng.Find.Wrap = WdFindWrap.wdFindStop;
rng.Find.Format = false;
rng.Find.MatchCase = false;
rng.Find.MatchWholeWord = false;
rng.Find.MatchWildcards = false;
rng.Find.Execute();
if (!rng.Find.Found)
break;
// increment counter
totalPageBreaks++;
// do some processing here if you'd like
// reset the range
rng.Collapse(WdCollapseDirection.wdCollapseEnd);
}

Related

How to use advanced find in MS Word with .NET C#

I have a problem like this: I need to use advanced find to find a specific text (base on Font of the text, the text string, wildcard,...), and write down to a notepad file which page did I find that text.
I see that in C# .Net has this Find.Execute method, but I don't know if it's possible to do this, I have googled around but no hope.
But my idea is like this code
using Microsoft.Office.Core;
using Word = Microsoft.Office.Interop.Word;
using System.Reflection;
...
Word.Application oWord;
Word._Document oDoc;
oWord = new Word.Application();
oWord.Visible = false;
oDoc = oWord.Documents.Open(strPath, ReadOnly: true);
Word.Range findRange;
Word.Range resultRange;
int nPage;
//Get the range of the whole word document
int nEnd;
nEnd = oDoc.Paragraphs.Last.Range.Sentences.First.End;
findRange = oDoc.Range(0, nEnd);
//Setup find condition
//The color of the found text: RGB(243,99, 195) . Please help!
//Execute find --> Loop until not found anymore
{
//findRange.Find.Execute... Please help!
//Get the range of the found text
//resultRange = ... Please help!
//Get page of the result range
nPage = resultRange.get_Information(Word.WdInformation.wdActiveEndPageNumber);
//Do anything you like with nPage
}
//Close the process
oDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges);
((Word._Application)oWord).Quit(Word.WdSaveOptions.wdDoNotSaveChanges);
Thank you in advance.
Thank God, I found my solution.
After I read:
this article to find out how to loop the find next feature.
This article to find out that must use Find.Font.Color instead of Find.Font.TextColor.RGB
This article to get the page range (the code is pretty unclean, but usable)
Ok, here it goes
Word.Application oWord;
Word._Document oDoc;
oWord = new Word.Application();
oWord.Visible = false;
oDoc = oWord.Documents.Open(strWorkingPath, ReadOnly: true);
//===================Excute===================
/*Word 2013*/
oWord.ActiveWindow.View.ReadingLayout = false;
// Get pages count
Word.WdStatistic PagesCountStat = Word.WdStatistic.wdStatisticPages;
int nTotalPage = oDoc.ComputeStatistics(PagesCountStat);
int nEndOfTheDoc = oDoc.Paragraphs.Last.Range.Sentences.First.End;
int nStart = 0;
int nEnd = nEndOfTheDoc;
List<int> lstPage = new List<int>();
int color = 696969;//The color you can get by read the Font.Color of the Range in Debug view
Word.Range findRange;
object What = Microsoft.Office.Interop.Word.WdGoToItem.wdGoToPage;
object Which = Microsoft.Office.Interop.Word.WdGoToDirection.wdGoToAbsolute;
object nCrtPage;
object nNextPage;
bool bPageIsIn = false;
/*Loop the pages*/
for (int i = 1; i <= nTotalPage; i++)
{
/*Get the start and end position of the current page*/
nCrtPage = i;
nNextPage = i + 1;
nStart = oWord.Selection.GoTo(ref What,
ref Which, ref nCrtPage).Start;
nEnd = oWord.Selection.GoTo(ref What,
ref Which, ref nNextPage).End;
/*The last page: nStart will equal nEnd*/
if(nStart == nEnd)
{
/*Set nEnd for the last page*/
nEnd = nEndOfTheDoc;
}
/*Set default for Count page trigger*/
bPageIsIn = false;
/*Set the find range is the current page range*/
findRange = oDoc.Range(nStart, nEnd);
/*Set up find condition*/
findRange.Find.Font.Color = (Word.WdColor)color;
findRange.Find.Format = true;
findRange.Find.Text = "^?";
do
{
/*Loop find next*/
findRange.Find.Execute();
/*If found*/
if (findRange.Find.Found)
{
/*If found data is still in the page*/
if (findRange.End <= nEnd)
{
/*If found data is visible by human eyes*/
if (!string.IsNullOrWhiteSpace(findRange.Text))
{
/*Ok, count this page*/
bPageIsIn = true;
break;/*no need to find anymore for this page*/
}
}
}
else
break;/*no need to find anymore for this page*/
}while (findRange.End < nEnd);/*Make sure it is in that page only*/
if (bPageIsIn)
lstPage.Add(i);
}
//===================Close===================
oDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges);
((Word._Application)oWord).Quit(Word.WdSaveOptions.wdDoNotSaveChanges);
foreach (var item in lstPage)
{
builder.AppendLine(item.ToString());//Do anything you like with the list page
}

Splitting word document into separate pages using c#

An hour ago I been searching for a code that split word document into separate pages I found this question
Using the code in the thread
static class PagesExtension {
public static IEnumerable<Range> Pages(this Document doc) {
int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
int pageStart = 0;
for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
var page = doc.Range(
pageStart
);
if (currentPageIndex < pageCount) {
//page.GoTo returns a new Range object, leaving the page object unaffected
page.End = page.GoTo(
What: WdGoToItem.wdGoToPage,
Which: WdGoToDirection.wdGoToAbsolute,
Count: currentPageIndex+1
).Start-1;
} else {
page.End = doc.Range().End;
}
pageStart = page.End + 1;
yield return page;
}
yield break;
}
}
I call the code above using this code
var app = new Microsoft.Office.Interop.Word.Application();
object missObj = System.Reflection.Missing.Value;
app.Visible = false;
var doc = app.Documents.Open(fileLocation);
int pageNumber = 1;
foreach (var page in doc.Pages())
{
Microsoft.Office.Interop.Word.Document newDoc = app.Documents.Add(ref missObj, ref missObj, ref missObj, ref missObj);
page.Copy();
var doc2 = app.Documents.Add();
doc2.Range().Paste();
object newDocName = pageNumber.ToString() + ".docx";
Console.WriteLine(newDocName);
doc2.SaveAs2(newDocName, Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: Microsoft.Office.Interop.Word.WdCompatibilityMode.wdWord2010);
pageNumber++;
}
app.ActiveDocument.Close();
app.Quit();
But I'm getting an error in a specific document and here is the error
This method or property is not available because no text is selected.
What is the reason for it? i checked the document and found out that the document contains lots of spaces before the next page. How can I solve this?
And using the code above it didn't copy the header and footer. Thank you
Update: Error
This method or property is not available because no text is selected.
at Microsoft.Office.Interop.Word.Range.Copy()
at retrieveObjects(String location) in Document.cs:line 31
and this is the line
page.Copy();

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

I already read all related StackOverflow and haven't find a decent solution to this. I want to open a PDF, get the text (words) and their coordinates then further, add a sticky note to some of them.
Seems to be mission impossible, I'm stucked.
How come this code will correctly find all words in a page (but not their coordinates)?
using (PdfReader reader = new PdfReader(path))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 5; page <= 5; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
Console.WriteLine(text);
}
//txt = sb.ToString();
}
But this one gets coordinates, but for "chunks" that cannot rely they are in proper order.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
//strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
// new MyLocationTextExtractionStrategy("sample", System.Globalization.CompareOptions.None)
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
if (chunk.m_text.Trim() == "MCU_MOSI")
Console.WriteLine("Bingo"); // <-- NEVER HIT
}
//Console.WriteLine(strategy.m_SearchResultsList.ToString()); // strategy.GetResultantText() +
}
This uses a class from this post (little modified by me)
Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
But only finds useless "chunks".
So the question is can with iTextSharp really locate words in page so I can add some sticky notes nearby? Thank you.
It looks like the chunk.m_text only contains one letter at a time which is why it this will never be true:
if (chunk.m_text.Trim() == "MCU_MOSI")
What you could do instead is have each chunk text added to a string and see if it contains your text.
PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategyEx strategy;
string str = string.Empty;
for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
var x = strategy.m_SearchResultsList;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
str += chunk.m_text;
if (str.Contains("MCU_MOSI"))
{
str = string.Empty;
Vector location = chunk.m_endLocation;
Console.WriteLine("Bingo");
}
}
}
Note for the example of the location, I made m_endLocation public.

can not correctly insert text into a bookmark from another bookmark

I'm writing a windows form application which must exchange the content of Word bookmarks between two documents.
There are two similar documents (wordDocument and wordPattern) with similar amount of bookmarks. I'm trying this:
for (int i = 1; i <= wordDocument.Bookmarks.Count; i++)
{
object j = i;
wordDocument.Bookmarks.get_Item(ref j).Range.Text = wordPattern.Bookmarks.get_Item(ref j).Range.Text.ToString();
//MessageBox.Show(wordDocument.Bookmarks[i].Range.Text);
//MessageBox.Show(wordPattern.Bookmarks[i].Range.Text);
}
But it does the task incorrectly. I mean, it does it in improper order and deletes bookmarks. Help me by providing right way to exchange the text inside the bookmarks.
int count1 = 0;
int count2 = 0;
foreach (Word.Bookmark bookmark1 in wordDocument.Bookmarks)
{
Word.Range bmRange = bookmark1.Range;
//bmRange.Text = "заметка" + count1;
listOfRanges.Add(bmRange);
count1++;
}
foreach (Word.Bookmark bookmark2 in wordPattern.Bookmarks)
{
Word.Range mbRange = bookmark2.Range;
mbRange.Text = listOfRanges[count2].Text;
count2++;
}
Solved it that way.

How can I optimize this function? c# scan text file for strings

I'm writing a program to scan a text file for blocks of strings (lines) and output the blocks to a file when found
In my process class, the function proc() is taking an unusually long time to process a 6MB file. On a previous program I wrote where I scan the text for only one specific type of string it took 5 seconds to process the same file. Now I rewrote it to scan for the presence of different strings. it is taking over 8 minutes which is a significant difference. Does any one have any ideas how to optimize this function?
This is my RegEx
System.Text.RegularExpressions.Regex RegExp { get { return new System.Text.RegularExpressions.Regex(#"(?s)(?-m)MSH.+?(?=[\r\n]([^A-Z0-9]|.{1,2}[^A-Z0-9])|$)", System.Text.RegularExpressions.RegexOptions.Compiled); } }
.
public static class TypeFactory
{
public static List<IMessageType> GetTypeList()
{
List<IMessageType> types = new List<IMessageType>();
types.AddRange(from assembly in AppDomain.CurrentDomain.GetAssemblies()
from t in assembly.GetTypes()
where t.IsClass && t.GetInterfaces().Contains(typeof(IMessageType))
select Activator.CreateInstance(t) as IMessageType);
return types;
}
}
public class process
{
public void proc()
{
IOHandler.Read reader = new IOHandler.Read(new string[1] { #"C:\TEMP\DeIdentified\DId_RSLTXMIT.LOG" });
List<IMessageType> types = MessageType.TypeFactory.GetTypeList();
//TEST1
IOHandler.Write.writeReport(System.DateTime.Now.ToString(), "TEST", "v3test.txt", true);
foreach (string file in reader.FileList)
{
using (FileStream readStream = new FileStream(file, FileMode.Open, FileAccess.Read))
{
int charVal = 0;
Int64 position = 0;
StringBuilder fileFragment = new StringBuilder();
string message = string.Empty;
string current = string.Empty;
string previous = string.Empty;
int currentLength = 0;
int previousLength = 0;
bool found = false;
do
{
//string line = reader.ReturnLine(readStream, out charVal, ref position);
string line = reader.ReturnLine(readStream, out charVal);
for (int i = 0; i < types.Count; i++)
{
if (Regex.IsMatch(line, types[i].BeginIndicator)) //found first line of a message type
{
found = true;
message += line;
do
{
previousLength = types[i].RegExp.Match(message).Length;
//keep adding lines until match length stops growing
//message += reader.ReturnLine(readStream, out charVal, ref position);
message += reader.ReturnLine(readStream, out charVal);
currentLength = types[i].RegExp.Match(message).Length;
if (currentLength == previousLength)
{
//stop - message complete
IOHandler.Write.writeReport(message, "TEST", "v3test.txt", true);
//reset
message = string.Empty;
currentLength = 0;
previousLength = 0;
break;
}
} while (charVal != -1);
break;
}
}
} while (charVal != -1);
//END OF FILE CONDITION
if (charVal == -1)
{
}
}
}
IOHandler.Write.writeReport(System.DateTime.Now.ToString(), "TEST", "v3test.txt", true);
}
}
.
EDIT: I ran profiling wizard in VS2012 and I found most time was spent on RegEx.Match function
Here are some thoughts:
RegEx matching is not the most efficient way to do a substring search, and you are performing the match check once per "type" of match. Have a look at efficient substring matching algorithms such as Boyer-Moore if you need to match literal substrings rather than patterns.
If you must use RegEx, consider using compiled expressions.
Use a BufferedStream to improve IO performance. Probably marginal for a 6MB file, but it only costs a line of code.
Use a profiler to be sure exactly where time is being spent.
High level ideas:
Use Regex.Matches to find all matches at once instead of one by one. Probably the main performance hit
Pre-build the search pattern to include multiple messages at once. You can use Regex OR.

Categories

Resources