Finding words in an office word document

Finding words in an office word document - c#

I'm working on a program that would classify files to groups based on certain text found in them. Most of the files are possibly going to be .doc or .docx.
My program should be able to compare a list of words with words in the files.
I'm new to C# and i only study programming on my own, and the whole "read .doc file" thing goes way over my head, so any help would be greatly appreciated!
So far the part of my code that has to do with office is:
CODE
if (Path.GetExtension(listBox1.SelectedItem.ToString()) == ".doc" ||
Path.GetExtension(listBox1.SelectedItem.ToString()) == ".docx")
{
Microsoft.Office.Interop.Word.Document doc =
new Microsoft.Office.Interop.Word.Document(listBox1.SelectedItem.ToString());
doc.Activate();
}
EDIT:
Sorry if the question wasn't clear enough.
My question is:
How can i find, if the document contains any of the specific words contained in a text file.
I have read many other questions, answers and tutorials and it might be just me but I totally don't get it.

Here is an introduction on reading text out of a .docx file: http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files
You could convert the .doc files to .docx files and use the same process for both.

you seem to be using Microsoft's interop classes so you can use the Outlook.Interop.Find
MSDN description and HOW TO
The execute method will return true if the document contains the word.
StringBuilder sb = new StringBuilder();
Word.Range rng = rodape.Range;
Word.Find find = rng.Find;
find.ClearFormatting();
find.Replacement.ClearFormatting();//Only required if you will replace the text
if (find.Execute("textToBeFound", false))
{
//The document contains the word
}
Another example, from microsoft:
private void SelectionFind() {
object findText = "find me";
Application.Selection.Find.ClearFormatting();
if (Application.Selection.Find.Execute(ref findText,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing))
{
MessageBox.Show("Text found.");
}
else
{
MessageBox.Show("The text could not be located.");
} }
But you have many other ways to do this..

Related

How to merge two Word Document Objects in C# (not on file system level)?

i want to append one Word Document Object to another in my C# Code.
Is there any way to do this instantly with Word Interop Functions when i have both Documents as an Object?
I tried several approaches with Merge Function but this function expects filenames as strings and no objects. I don't want to save both documents and load them again.
//schematic version of my code:
doc_template = wordapp.Documents.Open(ref filedocA,
ref readOnly, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing);
doc_final = wordapp.Documents.Add(ref missing, ref missing, ref missing, ref missing);
wordapp.Document tempdoc = null;
foreach (customer cust in dt_customers)
{
tempdoc = doc_template;
//Do some search and replace functions in tempdoc
...
//here somehow the complete tempdoc should be appended to doc_final
}
doc_final should contain all the modified tempdoc from each customer
//Update: Heres what i am doing now: Copying the contents within ranges
object start = tempdoc.Content.Start;
object end = tempdoc.Content.End;
tempdoc.Range(ref start, ref end).Copy();
doc_final.Activate();
start = Destination.Content.End;
end = Destination.Content.End;
start = wordapp.ActiveDocument.Content.End - 1;
rng = Destination.Range(ref start, ref missing);
rng.Select();
rng.Paste();
But is there a better option than simple copy the contents of the documents?

Merging multiple docx files to one

I am developing a desktop application in C#. I have coded a function to merge multiple docx files but it does not work as expected. I don't get the content exactly as how it was in the source files.
A few blank lines are added in between. The content extends to the next pages, header and footer information is lost, page margins gets changed, etc..
How can I concatenate docs as it is without and change in it.Any suggestions will be helpful.
This is my code.
public bool CombineDocx(string[] filesToMerge, string destFilepath)
{
Application wordApp = null;
Document wordDoc = null;
object outputFile = destFilepath;
object missing = Type.Missing;
object pageBreak = WdBreakType.wdPageBreak;
try
{
wordApp = new Application { DisplayAlerts = WdAlertLevel.wdAlertsNone, Visible = false };
wordDoc = wordApp.Documents.Add(ref missing, ref missing, ref missing, ref missing);
Selection selection = wordApp.Selection;
foreach (string file in filesToMerge)
{
selection.InsertFile(file, ref missing, ref missing, ref missing, ref missing);
selection.InsertBreak(ref pageBreak);
}
wordDoc.SaveAs( ref outputFile, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing);
return true;
}
catch (Exception ex)
{
Msg.Log(ex);
return false;
}
finally
{
if (wordDoc != null)
{
wordDoc.Close();
}
if (wordApp != null)
{
wordApp.DisplayAlerts = WdAlertLevel.wdAlertsAll;
wordApp.Quit();
Marshal.FinalReleaseComObject(wordApp);
}
}
}

In my opinion it's not so easy. Therefore I'll give you some tips here.
I think you need to implement the following changes to your code.
1.instead of pageBreak you need to add any of section breaks which could be the most appropriate:
object sectionBrak = WdBreakType.wdSectionBreakNextPage;
'other section break types also available
and use this new variable within your loop.
As a result you get all- text, footers and headers of the source document to new one.
2.However, you will still need to read margin parameters and apply them to your new document 'manually' using additional code. Therefore you will need to open source document and check it's margins in this way:
intLM = SourceDocument.Sections(1).PageSetup.LeftMargin;
'...and similar for other margins
and next you need to apply it to new document, to appropriate section:
selection.Sections(1).PageSetup.LeftMargin = intLM;
3.Some other document section could require some other techniques.

You could use the Open XML SDK and the DocumentBuilder tool.
See Merge multiple word documents into one Open Xml

Word Interop 2007 silent printing issue

I am using:
Office 2007
VC# Express 2010
1x Citrix virtual XP network environment accessed through Windows 7 laptop host
1x printer set to output to .prn in a given network-mapped drive
I am using C# and Word Interop to silently print a given set of files automatically. The application scans an input folder every 10 minutes for .doc / .docx files only, and inputs their path&filename into a list. Foreach found file, attempt to print via the following code:
public static Boolean PrintFoundFiles(List<string> foundFiles)
{
successful = false;
foreach (string file in foundFiles)
{
object fileAndPath = file; //declare my objects here, since methods want an object passed
object boolTrue = true; //
object boolFalse = false; //
object output = FormatOutputName(file); //
object missing = System.Type.Missing; //
object copies = "1"; //
object pages = ""; //
object items = Word.WdPrintOutItem.wdPrintDocumentContent; //
object range = Word.WdPrintOutRange.wdPrintAllDocument; //
object pageType = Word.WdPrintOutPages.wdPrintAllPages; //
Word.Application wordApp = new Word.Application(); //open word application
wordApp.Visible = false; //invisible
Word.Document doc = wordApp.Documents.Open(ref fileAndPath, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing); //opens the word document into application behind the scenes
doc.Activate(); //activates document, else when I print the system will throw an exception
wordApp.ActivePrinter = "my printer name"; //Specify printer I will print from
doc.PrintOut(ref boolTrue, ref boolFalse, ref range, ref output, ref missing, ref missing,
ref items, ref copies, ref pages, ref pageType, ref boolTrue, ref boolTrue,
ref missing, ref boolFalse, ref missing, ref missing, ref missing, ref missing);
doc.Close(SaveChanges: false);
doc = null;
((Word._Application)wordApp).Quit(SaveChanges: false); //kill word process the right way
wordApp = null; //reset to null
successful = true;
}
return successful;
}
After I receive the true boolean of "successful", I will back up the file automatically in a backup folder, delete it in the input folder, and look for the .prn in the output folder (it just sits here for processing later).
If I don't provide an output (see ref output on doc.PrintOut()), the output directory doesn't get updated or printed to at all. If I DO provide an output, the .prn is created, though it is a 0kb empty file.
The printer is set up as the default printer, and it has been configured to automatically output to said output folder. If I open Word manually with the same file I'm trying to automatically print from, hit print, it will create a 6kb .prn file in the output directory without having to hit anything other than CTRL + P, OK.
I'm fairly confident the file is being opened OK via "Word.Document doc = wordApp.Documents.Open()" because I did a doc.FullName and got the full path of the input word document in question. I just cannot for the life of me get the .prn to output correctly to the output folder.

If I start my word (2010) and record a macro of me pressing Ctrl+P and hitting print - I'm getting
Application.PrintOut fileName:="", Range:=wdPrintAllDocument, Item:= _
wdPrintDocumentWithMarkup, Copies:=1, Pages:="", PageType:= _
wdPrintAllPages, Collate:=True, Background:=True, PrintToFile:=False, _
PrintZoomColumn:=0, PrintZoomRow:=0, PrintZoomPaperWidth:=0, _
PrintZoomPaperHeight:=0
Change your PrintOut to reflect what Word did and see if it solves your issue.
There's no reason to be "fairly confident", just remove
wordApp.Visible = false
Debug your program and make certain it's OK.

C# Word Document - How to clean formatting?

The dilemma is rather simple. I need to create a small app that will clear all font background colors (leave table cell background colours unchanged), and remove all text with strikethrough in a word document, and then save the document into another folder. Otherwise the document's formatting should remain untouched.
Below is a large-ish example scraped together from random examples available in google showing how to apply specific kinds of formatting to random strings found using Find.Execute(). I have no clue however, on how to only do as described above.
public static string searchDoc(string fileNameRef)
{
Microsoft.Office.Interop.Word._Application word = new Microsoft.Office.Interop.Word.Application(); ;
Microsoft.Office.Interop.Word._Document doc = new Microsoft.Office.Interop.Word.Document();
object missing = System.Type.Missing;
try
{
System.IO.FileInfo ExecutableFileInfo =
new System.IO.FileInfo(System.Reflection.Assembly.GetEntryAssembly().Location);
object fileName =
System.IO.Path.Combine(ExecutableFileInfo.DirectoryName, fileNameRef);
doc = word.Documents.Open(ref fileName, ref missing, ref missing, ref missing
, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing
, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
doc.Activate();
//object findStr = "hello"; //sonething to find
// THIS is the part where I fail, I can't find of a way to Find.Execute on formatting
// as opposed to mere strings.
//while (word.Selection.Find.Execute(ref findStr)) //found...
//{
// //change font and format of matched words
// word.Selection.Font.Name = "Tahoma"; //change font to Tahoma
// word.Selection.Font.ColorIndex = Microsoft.Office.Interop.Word.WdColorIndex.wdRed; //change color to red
//}
object saveFileName = ExecutableFileInfo.DirectoryName + "\\New\\" + fileNameRef;
doc.SaveAs(ref saveFileName, ref missing, ref missing, ref missing, ref missing
, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing
, ref missing, ref missing, ref missing, ref missing, ref missing);
}
catch (Exception)
{
}
finally
{
doc.Close(ref missing, ref missing, ref missing);
word.Application.Quit(ref missing, ref missing, ref missing);
}
return fileNameRef;
}
Thanks for any help! And I do mean any, simply getting started on how to spot formatting would help a great deal, I imagine. :)

This is not a C#-specific question; it's a Word Object Model question (I refer you to here and here).
As to your specific question, I suggest you turn on the Macro Recorder in Word, perform the actions, and see the generated VBA code. Then you can apply it in C#.
Try this:
using System;
using Microsoft.Office.Interop.Word;
using System.IO;
using System.Reflection;
namespace WordFormattingFindReplace {
class Program {
static void Main(string[] args) {
}
public static string searchDoc(string fileName) {
_Application word = new Application(); ;
_Document doc;
string folderName = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location);
string filePath = Path.Combine(folderName,fileName);
doc = word.Documents.Open(filePath);
var find=doc.Range().Find;
find.Text="Hello";
find.Format=true;
find.Replacement.Font.Name="Tahoma";
find.Replacement.Font.ColorIndex=WdColorIndex.wdRed;
find.Execute(Replace:WdReplace.wdReplaceAll);
doc.SaveAs2(Path.Combine(folderName,"New",fileName));
doc.Close();
//We need to cast this to _Application to resolve which Quit method is being called
((_Application)word.Application).Quit();
return fileName;
}
}
}
Some notes:
Use using statements for clarity. Instead of Microsoft.Office.Interop.Word._Application word, add using Microsoft.Office.Interop.Word at the top of your file, and you can then just write _Application word
If all you need is the folder name, use the static Path.GetDirectoryName method and save as a string variable, instead of creating a FileInfo object
As of .NET 4, you can skip optional arguments when calling Documents.Open, Document.SaveAs and Document.Close. This also means you don't need an object missing.
There's nothing here the user really needs to see, so calling Document.Activate is unnecessary
It's probably better to reuse the Word.Application instance, instead of recreating it for each call.

automation of Doc to PDF in c#

I have got about 200 word documents that I need to pdf.
Obviously, I cannot pdf them one by one as, first it will take ages, second I am sure it is not good practice to do so.
I need to find a way to automate that conversion, since we will need to this again and again.
I use C#, but the solution does not necessarily have to be in c#, but it is preferred.
I have had a look at few libraries such as PDfCreator, Office 2007 add-in, ITextSharp, and so forth and there is not any clear answer on the forums.
PDFCreator has c# sample, but it does only work with txt files.
Office 2007 add in does not have document locking capabilities which a must on the automation.
has anyone implemented such scenario before? I would like you hear your suggestions.
Thanks in advance
regards

You can try the method in this blog post:
http://angrez.blogspot.com/2007/06/create-pdf-in-net-using-pdfcreator.html

I'm doing this to automate the conversion of our doc and docx documents to pdf:
private bool ConvertDocument(string file)
{
object missing = System.Reflection.Missing.Value;
OW.Application word = null;
OW.Document doc = null;
try
{
word = new OW.Application();
word.Visible = false;
word.ScreenUpdating = false;
Object filename = (Object)file;
doc = word.Documents.Open(ref filename, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing);
doc.Activate();
if (Path.GetExtension(file) == ".docx")
file = file.Replace(".docx", ".pdf");
else
file = file.Replace(".doc", ".pdf");
object fileFormat = OW.WdSaveFormat.wdFormatPDF;
doc.ExportAsFixedFormat(file, OW.WdExportFormat.wdExportFormatPDF, false, OW.WdExportOptimizeFor.wdExportOptimizeForPrint,
OW.WdExportRange.wdExportAllDocument, 1, 1, OW.WdExportItem.wdExportDocumentContent, true, true, OW.WdExportCreateBookmarks.wdExportCreateNoBookmarks,
true, true, false, ref missing);
}
catch(Exception ex)
{
return false;
}
finally
{
if (doc != null)
{
object saveChanges = OW.WdSaveOptions.wdDoNotSaveChanges;
((OW._Document)doc).Close(ref saveChanges, ref missing, ref missing);
doc = null;
}
if (word != null)
{
((OW._Application)word).Quit(ref missing, ref missing, ref missing);
word = null;
}
}
return true;
}
where OW is an alias for Microsoft.Office.Interop.Word.

Have you check this MSDN article?
Edit:
Notice that this "How To" samples will not work as-is because:
For some reasons it runs over the program parameters (ConvertDocCS.exe [sourceDoc] [targetDoc] [targetFormat]) in line #77, #81 & #82.
I converted the project to VS 2010 and had to re-reference Microsoft.Office.Core. It's a COM reference called Microsoft Office 12.0 Object Library.
The sample do not except a relative path.
I'm sure you will manage to overcome those obstacles :)
One last thing. If you are working with .NET 4 you don't need to send all those annoying Missing.Value thanks to the wonder of optional parameters.

You may try Aspose.Words for .NET to convert DOC files to PDF. It can be used in any .NET application with C# or VB.NET like any other .NET assembly. It also work on any Windows OS and in 32/64-bit systems.
Disclosure: I work as developer evangelist at Aspose.

As HuBeZa said, if Word is installed on your workstation, you can use Word Automation to open your files one by one and save them as PDF.
All you need is referencing the COM component "Microsoft Word Object Library" and play with the classes of this assembly.
The execution time will probably a bit long, but your conversions will be automated.

We can set fonts for word automation, I applied single font to all generated documents from my solution for same application- and saved my time to manually go in each template and set the font separately for each tag and heading and etc...
using (WordprocessingDocument wordProcessingDocument = WordprocessingDocument.Open(input, true))
{
// Get all content control elements
List<DocumentFormat.OpenXml.OpenXmlElement> elements =
wordProcessingDocument.MainDocumentPart.Document.Body.ToList();
// Get and set the style properties of each content control
foreach (var itm in elements)
{
try
{
List<RunProperties> list_runProperties =
itm.Descendants<RunProperties>().ToList();
foreach (var item in list_runProperties)
{
if (item.RunFonts == null)
item.RunFonts = new RunFonts();
item.RunFonts.Ascii = "Courier New";
item.RunFonts.ComplexScript = "Courier New";
item.RunFonts.HighAnsi = "Courier New";
item.RunFonts.Hint = FontTypeHintValues.ComplexScript;
}
}
catch (Exception)
{
//continue for other tags in document
//throw;
}
}
wordProcessingDocument.MainDocumentPart.Document.Save();
}

I think straight answer to this is no!!!
but it is possible through workaround what i suggest is use imagemagik or some library and see if it can provide images of your word doc and then use these images in itextsharp to create pdf

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding words in an office word document - c#

Here is an introduction on reading text out of a .docx file: http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files You could convert the .doc files to .docx files and use the same process for both.

Related

How to merge two Word Document Objects in C# (not on file system level)?

Merging multiple docx files to one

Word Interop 2007 silent printing issue

C# Word Document - How to clean formatting?

automation of Doc to PDF in c#

Categories

Resources