C# Word Interop - Spell Checking in a Certain Language

C# Word Interop - Spell Checking in a Certain Language - c#

For a customer of mine I need to force the spell checking in a certain language.
I have explored the MSDN documentation and found that when calling the CheckSpelling() method in the active document, it will invoke the spelling check. This method has parameters for custom dictionaries.
My problem is that I can't find anything about those dictionaries or how to use them.
Also there is still the possibility that there is of course another way to do this.
Can anybody boost me in the right direction?

Found my solution:
foreach (Range range in activeDocument.Words)
{
range.LanguageID = WdLanguageID.wdFrenchLuxembourg;
}
Edit after comment
Since my activedocument is in a variable I seem to lose the static Range property. I found a work arround by doing the following. (lan is my variable where i keep my WdLanguageId)
object start = activeDocument.Content.Start;
object end = activeDocument.Content.End;
activeDocument.Range(ref start, ref end).LanguageID = lan;
thanks #Adrianno for all the help!

The Spell Checker uses the language of the text to select rules and dictionaries (look here to check how it works).
You have to set the text language to what you need and then SC will use that language. Follow this link for more details:
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.language.aspx

I have been working with this lately and thought I would add a bit to the already given answers.
To get a list of spelling errors in the document for a certain language, doing the following would get you going:
// Set the proofing language
myDocument.Content.LanguageID = WdLanguageID.wdDanish;
// Get the spelling errors (returns a ProofreadingErrors collection)
var errors = myDocument.SpellingErrors;
// There is no "ProofreadingError" object -> errors are accessed as Ranges
foreach (Range proofreadingError in errors)
Console.WriteLine(proofreadingError.Text);
As pointed out by Adriano, the key is to specify the language of the document content at first, and then you can access the spelling errors for the given language. I have tested this (Word Interop API version 15, Office 2013), and it works.
If you want to get suggestions for each of the misspelled words as well, I suggest you take a look at my previous answer to that issue: https://stackoverflow.com/a/14202099/700926
In that answer I provide sample code as well as links to relevant documentation for how that is done. In particular, the sample covers how to carry out spell checking of a given word in a certain language (of your choice) using Word Interop. The sample also covers how to access the suggestions returned by Word.
Finally, I have a couple of notes:
In contrast to the current accepted answer (your own) - this approach is much faster since it do not have to iterate through each word. I have been working with Word Interop for reports (100+ pages) and trust me, you don't want to sit and wait for that iteration to finish.
Information regarding the SpellingErrors property can be found here.
Information regarding the non-existence of a ProofreadingError object can be found here.

Never user foreach statements when accessing Office object. Most of the Office objects are COM object, and using foreach leads to memory leaks.
The following is a piece of working code
Microsoft.Office.Interop.Word.ProofreadingErrors errorCollection = null;
try
{
errorCollection = Globals.ThisAddIn.Application.ActiveDocument.SpellingErrors;
// Indexes start at 1 in Office objects
for (int i = 1; i <= errorCollection .Count; i++)
{
int start = errorCollection[i].Start;
int end = errorCollection[i].End;
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
finally
{
// Release the COM objects here
// as finally shall be always called
if (errorCollection != null)
{
Marshal.ReleaseComObject(errorCollection);
errorCollection = null;
}
}

Related

I'm trying to skip tokens in gplex (a flex/lex port) and using yylex() is causing a stack overflow. Is there a better way to skip?

So right now in my lexer I'm trying to skip certain tokens like comments and whitespace, except I need to add them to my "skipped list", rather than hiding them altogether.
In my Scanner frame I have
public int Skip(int sym) {
Token t = _InitToken();
t.SymbolId=sym;
t.Line = Current.Line;
t.Column =Current.Column;
t.Position=Current.Position;
t.Value = yytext;
t.Skipped = null;
_skipped.Add(t);
return yylex();
}
keep in mind this is c# but the interface isn't much different than the C one in lex/flex
I then use this function above in my scanner like so:
"/*" { if(!_TryReadUntilBlockEnd("*/")) return -1; return Skip(478); }
\/\/[^\n]* { return Skip(477); }
where 477 is my symbol id (the lex file is generated hence the lack of constants)
All _TryReadUntilBlockEnd("*/") does is read until it finds a trailing */, consuming it. It's a well tested method and can be ignored for the purposes of this question, except as explanation for how i match the end of a comment. This takes over the underlying input from gplex and handles advancing the underlying input stream itself (like fget() or whatever in C i forget). Basically it's neutral here other than reading the entire comment. Skip(478) is the relevant bit, not this.
It works fine in many cases. The only problem is I'm using it in a recursive descent parser that's parsing C#, and so the stack gets heavy, and when i have a huge stream of line comments it stack overflows.
I can solve it by finding some way to run a match without invoking a lex action again instead of calling yylex() if it's possible - that way i can rewrite it to be iterative, but i have no idea how, and what I've seen from the generated code suggests it's not possible.
The other way I can solve it - and this is my preferred way - is to match multiple C# line comments in one match. That way I only recurse once.
But this is multiline match expression which is disabled by default i think?
How do i enable multiline matching in either flex, lex or *gplex? Or is there another solution to the above problem? *gplex 1.2.2 preferred but it's completely undocumented
I'll take anything at this point. Thanks in advance!

I shouldn't have been calling yylex() at all. Thanks Jonathan and rici in the comments.

Is it advisable to use tokens for the purpose of syntax highlighting?

I'm trying to implement syntax highlighting in C# on Android, using Xamarin. I'm using the ANTLR v4 library for C# to achieve this. My code, which is currently syntax highlighting Java with this grammar, does not attempt to build a parse tree and use the visitor pattern. Instead, I simply convert the input into a list of tokens:
private static IList<IToken> Tokenize(string text)
{
var inputStream = new AntlrInputStream(text);
var lexer = new JavaLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
tokenStream.Fill();
return tokenStream.GetTokens();
}
Then I loop through all of the tokens in the highlighter and assign a color to them based on their kind.
public void HighlightAll(IList<IToken> tokens)
{
int tokenCount = tokens.Count;
for (int i = 0; i < tokenCount; i++)
{
var token = tokens[i];
var kind = GetSyntaxKind(token);
HighlightNext(token, kind);
if (kind == SyntaxKind.Annotation)
{
var nextToken = tokens[++i];
Debug.Assert(token.Text == "#" && nextToken.Type == Identifier);
HighlightNext(nextToken, SyntaxKind.Annotation);
}
}
}
public void HighlightNext(IToken token, SyntaxKind tokenKind)
{
int count = token.Text.Length;
if (token.Type != -1)
{
_text.SetSpan(_styler.GetSpan(tokenKind), _index, _index + count, SpanTypes.InclusiveExclusive);
_index += count;
}
}
Initially, I figured this was wise because syntax highlighting is largely context-independent. However, I have already found myself needing to special-case identifiers in front of #, since I want those to get highlighted as annotations just as on GitHub (example). GitHub has further examples of coloring identifiers in certain contexts: here, List and ArrayList are colored, while mItems is not. I will likely have to add further code to highlight identifiers in those scenarios.
My question is, is it a good idea to examine tokens rather than a parse tree here? On one hand, I'm worried that I might have to end up doing a lot of special-casing for when a token's neighbors alter how it should be highlighted. On the other, parsing will add additional overhead for memory-constrained mobile devices, and make it more complicated to implement efficient syntax highlighting (e.g. not re-tokenizing/parsing everything) when the user edits text in the code editor. I also found it significantly less complicated to handle all of the token types rather than the parser rule types, because you just switch on token.Type rather than overriding a bunch of Visit* methods.
For reference, the full code of the syntax highlighter is available here.

It depends on what you are syntax highlighting.
If you use a naive parser, then any syntax error in the text will cause highlighting to fail. That makes it quite a fragile solution since a lot of the texts you might want to syntax highlight are not guaranteed to be correct (particularly user input, which at best will not be correct until it is fully typed). Since syntax highlighting can help make syntax errors visible and is often used for that purpose, failing completely on syntax errors is counter-productive.
Text with errors does not readily fit into a syntax tree. But it does have more structure than a stream of tokens. Probably the most accurate representation would be a forest of subtree fragments, but that is an even more awkward data structure to work with than a tree.
Whatever the solution you choose, you will end up negotiating between conflicting goals: complexity vs. accuracy vs. speed vs. usability. A parser may be part of the solution, but so may ad hoc pattern matching.

Your approach is totally fine and pretty much what everybody's using. And it's totally normal to fine tune type matching by looking around (and it's cheap since the token types are cached). So you can always just look back or ahead in the token stream if you need to adjust actually used SyntaxKind. Don't start parsing your input. It won't help you.

I ended up choosing to use a parser because there were too many ad hoc rules. For example, although I wanted to color regular identifiers white, I wanted types in type declarations (e.g. C in class C) to be green. There ended up being about 20 of these special rules in total. Also, the added overhead of parsing turned out to be miniscule compared to other bottlenecks in my app.
For those interested, you can view my code here: https://github.com/jamesqo/Repository/blob/e5d5653093861bc35f4c0ac71ad6e27265e656f3/Repository.EditorServices/Internal/Java/Highlighting/JavaSyntaxHighlighter.VisitMethods.cs#L19-L76. I've highlighted all of the ~20 special rules I've had to make.

Find a range of text with specific formatting with Word interop

I have a MS Word add-in that needs to extract text from a range of text based solely on its formatting: in my case in particular, if the text is underlined or struck through, the range of characters/words that are underlined or struck through need to be found so that I can keep track of them.
My first idea was to use Range.Find, as is outlined here, but that won't work when I have no idea what the string is that I'm looking for:
var rng = doc.Range(someStartRange, someEndRange);
rng.Find.Forward = true;
rng.Find.Format = true;
// I removed this line in favor of putting it inside Execute()
//rng.Find.Text = "";
rng.Find.Font.Underline = WdUnderline.wdUnderlineSingle;
// this works
rng.Find.Execute("");
int foundNumber = 0;
while (rng.Find.Found)
{
foundNumber++;
// this needed to be added as well, as per the link above
rng.Find.Execute("");
}
MessageBox.Show("Underlined strings found: " + foundNumber.ToString());
I would happily parse the text myself, but am not sure how to do this while still knowing the formatting. Thanks in advance for any ideas.
EDIT:
I changed my code to fix the find underline issue, and with that change the while loop never terminates. More specifically, rng.Find.Found finds the underlined text, but it finds the same text over and over, and never terminates.
EDIT 2:
Once I added the additional Execute() call inside the while loop, the find functioned as needed.

You need
rng.Find.Font.Underline = wdUnderline.wdUnderlineSingle;
(At the moment you are setting the formatting for the specified rng, rather than the formatting for the Find)

How do I use C# to fill out a Word document?

I have a Word document, letter.docx, that is a letter I intend to mail to hundreds of people for a party. The letter is already composed and has been formatted in its own special way with varying type sizes and fonts. It's set and ready to go, with placeholders where I have to fill out variables that change like Name, Address, phone number, etc.
Now, I would like to write a C# program where a user can type in variable things like Name, Address, etc., into a form, hit a button, and produce letter.docx with the right information filled in at the right places.
I understand Word has features that allow you do this, but I really want to do this in C#.

Of course you can do it. Use Microsoft.Office.Interop.Word reference in your project.
First bookmark all the fields you want to be updated in the document from 'insert' tab (eg. NameField is bookmarked with tag 'name_field'). Then, in your C# code add the following:
Microsoft.Office.Interop.Word.Application wordApp = null;
wordApp = new Microsoft.Office.Interop.Word.Application();
wordApp.Visible = true;
Document wordDoc = wordApp.Documents.Open(#"C:\test.docx");
Bookmark bkm = wordDoc.Bookmarks["name_field"];
Microsoft.Office.Interop.Word.Range rng = bkm.Range;
rng.Text = "Adams Laura"; //Get value from any where
Remember to properly save & close the document.(You can see this)

I don't know of anything built into the language, but the example here seems to do exactly what you want.
If you can provide specific examples of what you want to do (are the placeholders Fields? specifically name bits of text?), I can probably give you a more refined answer that directly targets your problem.

Word Provides COM objects that one can make use of in C#
Add a reference to the Microsoft office interop under the COM tab in the add reference dialog
Also, see this question:
Filling in FIelds in work using C#

I had a situation where I needed to fill out some MS Word forms, so I used something similar to the following code (make sure you reference Microsoft.Office.Interop.Word; I used version 14, but you should adjust it to your own scenario):
// FormData is a custom container type that holds data... you'll have your own.
public static void FillOutForm(FormData data)
{
var app = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = null;
try
{
var filePath = "Your file path.";
doc = app.Documents.Add(filePath);
doc.Activate();
// Loop over the form fields and fill them out.
foreach(Microsoft.Office.Interop.Word.FormField field in doc.FormFields)
{
switch (field.Name)
{
// Text field case.
case "textField1":
field.Range.Text = data.SomeText;
break;
// Check box case.
case "checkBox1":
field.CheckBox.Value = data.IsSomethingTrue;
break;
default:
// Throw an error or do nothing.
break;
}
}
// Save a copy.
var newFilePath = "Your new file path.";
doc.SaveAs2(newFilePath);
}
catch (Exception e)
{
// Perform your error logging and handling here.
}
finally
{
// Make sure you close things out.
// I tend not to save over the original form, so I wouldn't save
// changes to it -- hence the option I chose here.
doc.Close(
Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges);
app.Quit();
}
}
As you can see, it's really not that hard at all. There are some other options on forms, so you'll have to research them, but the most general ones, the check box and the text box, are the ones I demonstrated here. If you didn't create a form, I suggest going through and making sure that you know all the fields, as that's what you'll need for this.

Prevent Word document's fields from updating when opened

I wrote a utility for another team that recursively goes through folders and converts the Word docs found to PDF by using Word Interop with C#.
The problem we're having is that the documents were created with date fields that update to today's date before they get saved out. I found a method to disable updating fields before printing, but I need to prevent the fields from updating on open.
Is that possible? I'd like to do the fix in C#, but if I have to do a Word macro, I can.

As described in Microsoft's endless maze of documentation you can lock the field code. For example in VBA if I have a single date field in the body in the form of
{DATE \# "M/d/yyyy h:mm:ss am/pm" \* MERGEFORMAT }
I can run
ActiveDocument.Fields(1).Locked = True
Then if I make a change to the document, save, then re-open, the field code will not update.
Example using c# Office Interop:
Word.Application wordApp = new Word.Application();
Word.Document wordDoc = wordApp.ActiveDocument;
wordDoc.Fields.Locked = 1; //its apparently an int32 rather than a bool
You can place the code in the DocumentOpen event. I'm assuming you have an add-in which subscribes to the event. If not, clarify, as that can be a battle on its own.
EDIT: In my testing, locking fields in this manner locks them across all StoryRanges, so there is no need to get the field instances in headers, footers, footnotes, textboxes, ..., etc. This is a surprising treat.

Well, I didn't find a way to do it with Interop, but my company did buy Aspose.Words and I wrote a utility to convert the Word docs to TIFF images. The Aspose tool won't update fields unless you explicitly tell it to. Here's a sample of the code I used with Aspose. Keep in mind, I had a requirement to convert the Word docs to single page TIFF images and I hard-coded many of the options because it was just a utility for myself on this project.
private static bool ConvertWordToTiff(string inputFilePath, string outputFilePath)
{
try
{
Document doc = new Document(inputFilePath);
for (int i = 0; i < doc.PageCount; i++)
{
ImageSaveOptions options = new ImageSaveOptions(SaveFormat.Tiff);
options.PageIndex = i;
options.PageCount = 1;
options.TiffCompression = TiffCompression.Lzw;
options.Resolution = 200;
options.ImageColorMode = ImageColorMode.BlackAndWhite;
var extension = Path.GetExtension(outputFilePath);
var pageNum = String.Format("-{0:000}", (i+1));
var outputPageFilePath = outputFilePath.Replace(extension, pageNum + extension);
doc.Save(outputPageFilePath, options);
}
return true;
}
catch (Exception ex)
{
LogError(ex);
return false;
}
}

I think a new question on SO is appropriate then, because this will require XML processing rather than just Office Interop. If you have both .doc and .docx file types to convert, you might require two separate solutions: one for WordML (Word 2003 XML format), and another for OpenXML (Word 2007/2010/2013 XML format), since you cannot open the old file format and save as the new without the fields updating.
Inspecting the OOXML of a locked field shows us this w:fldLock="1" attribute. This can be inserted using appropriate XML processing against the document, such as through the OOXML SDK, or through a standard XSLT transform.
Might be helpful: this how-do-i-unlock-a-content-control-using-the-openxml-sdk-in-a-word-2010-document question might be similar situation but for Content Controls. You may be able to apply the same solution to Fields, if the the Lock and LockingValues types apply the same way to fields. I am not certain of this however.
To give more confidence that this is the way to do it, see example of this vendor's solution for the problem. If you need to develop this in-house, then openxmldeveloper.org is a good place to start - look for Eric White's examples for manipulating fields such as this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.