Is there a way to translate a Microsoft word document to a string without using the Microsoft COM component? I am hoping there is some other way to deal with all of the excess markup.
EDIT 12/13/13:
We didn't want to reference the com component because if the customer didn't have the exact same version of office installed it wouldn't work. Luckily Microsoft has made the 2013 word.interop.dll backward compatible. Now we don't have to worry about this restriction. Once referencing the dll we can do the following:
/// <summary>Gets the content of the word document</summary>
/// <param name="filePath">The path to the word document file</param>
/// <returns>The content of the document</returns>
public string ExtractText(string filePath)
{
if (string.IsNullOrEmpty(filePath))
throw new ArgumentNullException("filePath", "Input file path not specified.");
if (!File.Exists(filePath))
throw new FileNotFoundException("Input file not found at specified path.", "filepath");
var resultText = string.Empty;
Application wordApp = null;
try
{
wordApp = new Application();
var doc = wordApp.Documents.Open(filePath, Type.Missing, true);
if (doc != null)
{
if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text))
resultText = doc.Content.Text.Normalize();
doc.Close();
}
}
finally
{
if (wordApp != null)
wordApp.Quit(false, Type.Missing, false);
}
return resultText;
}
You will need to use some library to achieve what you want:
MS provides the OpenXML SDK V 2.0 (free, DOCX only)
Aspose.Words (commercial, DOC and DOCX)
IF you have lots of time on your hands then writing a .DOC parser might be thinkable - the .DOC spec can be found here.
BTW: Office Interop is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !
Assuming you mean to extract the text content of a doc file, there are a few command line tools as well as commercial libraries. A rather old tool that we once used to search doc (not docx) files (in combination with the search engine sphider) was catdoc (also here) which is a DOS rather than a Windows tool but nonetheless worked for us as long as we met the prerequisites (file name format 8.3).
A commercial product doc2txt if you can afford $29.
For the newer docx format, you can use the Perl based tool docx2txt.
Of course, if you want to run those tools from c#, you need to trigger an external Process - check here for a solid explanation.
A rather expensive, but very powerful tool to access doc and docx content is Spire.doc, but it does a lot more than you need. It is more convenient to use as it is a .NET library.
If you are referring to an older DOC file format then that is quite an issue because it is a MS specified binary file format and I must say I totally agree with the RQDQ's comment.
But if you are referring to a DOCX file format then you can achieve this without MS COM component or any other component, just pure .NET.
Check the following solutions:
http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files
http://www.dotnetspark.com/kb/Content.aspx?id=5633
Related
I am trying to add some core properties to the Docx document. I have found only one example in different places of how it can be done.
For instance here. But there is a problem.
If we look at the structure of the Docx itself created by Word application and using OpenXml, there is a difference between them.
Structure of the docx created using openxml and document.PackageProperties.Creator = "vso"
Moreover, validation of the file can't be succeeded if I want to check the file by productivity tool from Microsoft. Of course, the word can read this file, but it is not a proper way to generate a word file from my point of view.
Here you can see the structure of the docx created by the word application itself
One more aspect, if I write following:
CoreFilePropertiesPart corePackageProperties = document.CoreFilePropertiesPart;
if (corePackageProperties == null)
{
corePackageProperties = document.AddCoreFilePropertiesPart();
}
then core.xml file is created in the proper place of structure, but it is empty.
So, the question is does OpenXML SDK have the way to get the structure of the docx the same as using the word application itself?
Microsoft documentation suggests :
using (XmlTextWriter writer = new XmlTextWriter(coreFilePropPart.GetStream(FileMode.Create), System.Text.Encoding.UTF8))
{
writer.WriteRaw("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<cp:coreProperties xmlns:cp=\"https://schemas.openxmlformats.org/package/2006/metadata/core-properties\"></cp:coreProperties>");
writer.Flush();
}
I had the same issue when creating an Excel file and this sort it out.
I am just starting out with the OpenXML SDK 2.0 in Visual Studio 2010 (C#). I have automated office programs before using COM automation, which was painful.
I have a template made by one of our graphic designers, which will provide the foundation for my reports. In order to automate the simple things (plaintext items) I have added content controls to the template and bound a custom XML part to the doc. The content controls are as follows:
DayCount
AlternateJobTitle
Date
SignatureName
After making a copy of the template, I then edit the content controls and save the file with the following code:
//stand up object that reads the Word doc package
using (WordprocessingDocument doc = WordprocessingDocument.Open(docOutputPath, true))
{
//create XML string matching custom XML part
string newXml = "<root>" +
"<DayCount>42</DayCount>" +
"<AlternateJobTitle>Supervisor</AlternateJobTitle>" +
"<Date>9/24/2012</Date>" +
"<SignatureName>John Doe</SignatureName>" +
"</root>";
MainDocumentPart main = doc.MainDocumentPart;
main.DeleteParts<CustomXmlPart>(main.CustomXmlParts);
//add and write new XML part
CustomXmlPart customXml = main.AddCustomXmlPart(CustomXmlPartType.CustomXml);
using (StreamWriter ts = new StreamWriter(customXml.GetStream()))
{
ts.Write(newXml);
}
}
This all works well. However, my document is not made up solely of standard text and plaintext updates. The real meat of the report is in a number of tables that need to be added to each report as well. I have been searching like crazy for a good description on how this is done, but have really not found anything. Is there some way to delineate where to place a table using the same content control logic used for plaintext controls? Any code samples I have found of creating a table using OpenXML have just assumed that you want to append it to the end of the main document part. I would like to specify where the tables need to go in the template, generate the tables and place them in the specified regions of the template. Is this possible?
Any help is greatly appreciated.
There are a lot of OpenXml creation questions. But if you decide to take this path - answer is general - examine OpenXml Productivity Tool. At my PC it could be found at "C:\Program Files (x86)\Open XML SDK\V2.0\tool\OpenXmlSdkTool.exe". Just create in MsWord document which you want to create using OpenXml and reflect document's code using this tool. Good luck!
If you need to display tabled data, so far, the best thing I found is Word Document Generator at http://worddocgenerator.codeplex.com/.
I have a request to create a word document on the fly based on a template provided to me. I have done some research and everything seems to point at OpenXML. I have looked into that, but the cs file that gets created is over 15k lines and is breaking my VS 2010 (causing it to not respond every time I make a change).
I have been looking at this tutorial series on Open XML
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/10/13/getting-started-with-open-xml-development.aspx
I have done things in the past with text files and Regular Expressions, but since Word encrypts everything, that does not work. Are there any other options that are fairly lightweight for creating word documents from templates.
//Hi, It is quite simple.
//First, you should copy your Template file into another location.
string SourcePath = "C:\\MyTemplate.dotx";
string DestPath = "C:\\MyDocument.docx";
System.IO.File.Copy(SourcePath, DestPath);
//After copying the file, you can open a WordprocessingDocument using your Destination Path.
WordprocessingDocument Mydoc = WordprocessingDocument.Open(DestPath, true);
//After openning your document, you can change type of your document after adding additional parts into your document.
mydoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//If you wish, you can edit your document
AttachedTemplate attachedTemplate1 = new AttachedTemplate() { Id = "MyRelationID" };
MainDocumentPart mainPart = mydoc.MainDocumentPart;
MySettingsPart = mainPart.DocumentSettingsPart;
MySettingsPart.Settings.Append(attachedTemplate1);
MySettingsPart.AddExternalRelationship("http://schemas.openxmlformats.org/officeDocument/2006/relationships/attachedTemplate", new Uri(CopyPath, UriKind.Absolute), "MyRelationID");
//Finally you can save your document.
mainPart.Document.Save();
I am currently working on something along these lines and I have been making use of the Open XML SDK and the OpenXmlPowerTools The approach been taken is taking the actual template file opening it up and putting text into various place holders within the template document. I have been using content controls as the place markers.
The SDK tool to open up a document has been invaluable in being able to compare documents and see how it is constructed. However the code generated from the tool I have been refactoring heavily and removing sections that are not being used at all.
I can't talk about doc files but with docx files they are not encrypted they are just zip files that contain xml files
Eric White's blog has a large number of examples and code samples which have been very useful
When I am trying to read .doc file using DocumentFormat.OpenXml dll its giving error as "File contains corrupted data."
This dll is reading .docx file properly.
Can DocumentFormat.OpenXml dll help in reading .doc file?
string path = #"D:\Data\Test.doc";
string searchKeyWord = #"java";
private bool SearchWordIsMatched(string path, string searchKeyWord)
{
try
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(path, true))
{
var text = wordDoc.MainDocumentPart.Document.InnerText;
if (text.Contains(searchKeyWord))
return true;
else
return false;
}
}
catch (Exception ex)
{
throw ex;
}
}
The old .doc files have a completely different format from the new .docx files. So, no, you can't use the OpenXml library to read .doc files.
To do that, you would either need to manually convert the files first, or you would need to use Office interop, instead of the Open XML SDK you're using now.
I'm afraid there won't be any better answer than the ones already given. The Microsoft Word DOC format is binary whereas OpenXML formats such as DOCX are zipped XML files. The OpenXml framework is for working with the latter only.
As suggested, the only other option you have is to use Word interop or third party library to convert DOC -> DOCX which you can then work with the OpenXml library.
.doc (If created with an older version of Microsoft Word) does not have the same structure as a .docx (Which is basically a zip file with some XML documents).
If your .doc is 'unzippable' (Just rename the .doc extension to .zip) to probe, you'll have to manually convert the .doc to a .docx.
You can use IFilterTextReader.
TextReader reader = new FilterReader(path);
using (reader)
{
txt = reader.ReadToEnd();
}
You can take a look at http://www.codeproject.com/Articles/13391/Using-IFilter-in-C
I have a requirement for an application that takes Doc, Docx and PDF and converts them to RTF.
The conversion is one way and I do not need to convert back to Doc or PDF.
Has anyone done this and can you recommend a libray? I know there is aspose but it's way to pricey and the licenses are per year so that's not going to work for the company I happen to work for.
I'm ok using more than one library for each of the file types if thats what it takes.
Thanks in advance
Telerik has a nice library to do this. They actually have an entire editor that looks like Microsoft Word. It can open multiple file formats and it saves natively as RTF (although it can save as PDF, DOCX, etc.) The one thing I'm not sure of is opening the PDF and saving as an RTF. I'm not sure that the Telerik library can do that.
Here is a link to the library:
http://www.telerik.com/products/wpf/richtextbox.aspx
For a PDF to RTF library, you could use this:
http://www.sautinsoft.com/products/pdf-focus/index.php
GroupDocs.Conversion Cloud is a REST API that converts all common file formats from on format to another reliably and easily. Its free pricing plan offers 50 free credits per month.
Here is sample code for PDF to RTF from default storage:
// Get App Key and App SID from https://dashboard.groupdocs.cloud/
var configuration = new GroupDocs.Conversion.Cloud.Sdk.Client.Configuration(MyAppSid, MyAppKey);
var apiInstance = new ConvertApi(configuration);
try
{
// convert settings
var settings = new GroupDocs.Conversion.Cloud.Sdk.Model.ConvertSettings
{
StorageName = null,
FilePath = "02_pages.pdf",
Format = "rtf",
ConvertOptions = new RtfConvertOptions(),
OutputPath = "02_pages.rtf"
};
// convert to specified format
List<StoredConvertedResult> response = apiInstance.ConvertDocument(new ConvertDocumentRequest(settings));
Console.WriteLine("Document converted successfully: " + response[0].Url);
}
catch (Exception e)
{
Console.WriteLine("Exception when calling ConvertApi.QuickConvert: " + e.Message);
}
I'm developer evangelist at aspose.