Reading .Doc File using DocumentFormat.OpenXml dll - c#

When I am trying to read .doc file using DocumentFormat.OpenXml dll its giving error as "File contains corrupted data."
This dll is reading .docx file properly.
Can DocumentFormat.OpenXml dll help in reading .doc file?
string path = #"D:\Data\Test.doc";
string searchKeyWord = #"java";
private bool SearchWordIsMatched(string path, string searchKeyWord)
{
try
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(path, true))
{
var text = wordDoc.MainDocumentPart.Document.InnerText;
if (text.Contains(searchKeyWord))
return true;
else
return false;
}
}
catch (Exception ex)
{
throw ex;
}
}

The old .doc files have a completely different format from the new .docx files. So, no, you can't use the OpenXml library to read .doc files.
To do that, you would either need to manually convert the files first, or you would need to use Office interop, instead of the Open XML SDK you're using now.

I'm afraid there won't be any better answer than the ones already given. The Microsoft Word DOC format is binary whereas OpenXML formats such as DOCX are zipped XML files. The OpenXml framework is for working with the latter only.
As suggested, the only other option you have is to use Word interop or third party library to convert DOC -> DOCX which you can then work with the OpenXml library.

.doc (If created with an older version of Microsoft Word) does not have the same structure as a .docx (Which is basically a zip file with some XML documents).
If your .doc is 'unzippable' (Just rename the .doc extension to .zip) to probe, you'll have to manually convert the .doc to a .docx.

You can use IFilterTextReader.
TextReader reader = new FilterReader(path);
using (reader)
{
txt = reader.ReadToEnd();
}
You can take a look at http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

Related

How to find a text in the uploaded PDF file in ASP.NET c#

I want to find whether a text is present in the uploaded PDF file in ASP.NET c#.
using (MemoryStream str = new MemoryStream(this.docUploadField.FileBytes))
{
using (StreamReader sr = new StreamReader(str, Encoding.UTF8))
{
string line = sr.ReadToEnd();
}
}
I am getting the below as the file content when I read the contents of file.
Please help me with this
You surely need some PDF reading library.
Most famous being
IText (ITextSharp for who remembers it): https://github.com/itext/itext7-dotnet
PdfSharp: https://github.com/empira/PDFsharp
and many other free options.
With those you open pdf file and read it and take the text you need.
Usually they give you a collection of the PDF elements (paragraphs, images, etc etc, and you loop through them or use a search function to look for what you need)

OpenXml. How to add creator using C# in docx?

I am trying to add some core properties to the Docx document. I have found only one example in different places of how it can be done.
For instance here. But there is a problem.
If we look at the structure of the Docx itself created by Word application and using OpenXml, there is a difference between them.
Structure of the docx created using openxml and document.PackageProperties.Creator = "vso"
Moreover, validation of the file can't be succeeded if I want to check the file by productivity tool from Microsoft. Of course, the word can read this file, but it is not a proper way to generate a word file from my point of view.
Here you can see the structure of the docx created by the word application itself
One more aspect, if I write following:
CoreFilePropertiesPart corePackageProperties = document.CoreFilePropertiesPart;
if (corePackageProperties == null)
{
corePackageProperties = document.AddCoreFilePropertiesPart();
}
then core.xml file is created in the proper place of structure, but it is empty.
So, the question is does OpenXML SDK have the way to get the structure of the docx the same as using the word application itself?
Microsoft documentation suggests :
using (XmlTextWriter writer = new XmlTextWriter(coreFilePropPart.GetStream(FileMode.Create), System.Text.Encoding.UTF8))
{
writer.WriteRaw("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<cp:coreProperties xmlns:cp=\"https://schemas.openxmlformats.org/package/2006/metadata/core-properties\"></cp:coreProperties>");
writer.Flush();
}
I had the same issue when creating an Excel file and this sort it out.

file data seems to be corrupted when reading file as a string

I'm trying to read a file as string. But it seems that the data is corrupted.
string filepaths = Files[0].FullName;
System.IO.StreamReader myFile = new System.IO.StreamReader(filepaths);
string datas = myFile.ReadToEnd();
but in datas, it contains "pk0101" etc instead of original data. I'm doing this so I can replace a placeholder with this string data,datas. And finally when I replace,gets replaced text as 0101 etc. Is it because of the content in datas. How can I read the file as string. Your help will be greatly appreciated. Thank You.
*.docx is a file format which in raw view represents xml document. Take a look here to become more familiar with this format definition.
For working with office formats Microsoft recommends to use Open Xml SDK at DocumentFormat.OpenXml library.
Here is a great article for learning how to work with Word files.
It works as follows:
using (var wordDocument = WordprocessingDocument.Open(string.Empty, false))
{
var body = wordDocument.MainDocumentPart.Document.Body;
var text = body.GetFirstChild<Paragraph>().InnerText;
}
Also, take a look at this SO question: How do I read data from a word with format using the OpenXML Format SDK with c#?

Read Content From Word Document In C# With Out Use Word Dll

hiii
i want to get content from Microsoft word file with out
Microsoft.Office.Interop dll uses.
I also use this code but its only read text from .xml file and .txt file not in .doc file
using System.IO;
using(StreamReader streamReader = new StreamReader(filePath)) { string text = streamReader.ReadToEnd(); }
office documents are more complex than simple xml/txt files since they contain much more text-related information (fonts, colors, locations, tables, images, etc etc).
Starting from Office 2007, microsoft uses the 'Office Open XML' format for saving office files. To parse a docx file, rename its extension to zip (e.g. untitled1.docx.zip) and extract its contents (using any zip app/library).
You will get a few files and folders, navigate to the 'word' folder and simply read the file named 'document.xml'.
This file contains all the textual information of the document (it is xml-formatted, so be sure to parse it correctly).
If you want to extract textual information of a pre-2007 files (e.g. 'doc' file), you will have to use Microsoft Office Compatibility Pack, which migrates files to the new format (it can be used programmatically, read about it)
Add the Namespace using Add Reference-->Browse-->Code7248.word_reader.dll.
Download the dll from the given URL :
sourceforge.net/p/word-reader/wiki/Home/
(A simple .NET Library compatible with .NET 2.0, 3.0, 3.5 and 4.0 for C#. It can currently extract only the raw text from a .doc or .docx file.)
The Sample Code is in simple Console in C#:
using System;
using System.Collections.Generic;
using System.Text;
//add extra namespaces
using Code7248.word_reader;
namespace testWordRead
{
class Program
{
private void readFileContent(string path)
{
TextExtractor extractor = new TextExtractor(path);
string text = extractor.ExtractText();
Console.WriteLine(text);
}
static void Main(string[] args)
{
Program cs = new Program();
string path = "D:\Test\testdoc1.docx";
cs.readFileContent(path);
Console.ReadLine();
}
}
}
It's working fine with doc & docx formet files.

How to translate .doc to string?

Is there a way to translate a Microsoft word document to a string without using the Microsoft COM component? I am hoping there is some other way to deal with all of the excess markup.
EDIT 12/13/13:
We didn't want to reference the com component because if the customer didn't have the exact same version of office installed it wouldn't work. Luckily Microsoft has made the 2013 word.interop.dll backward compatible. Now we don't have to worry about this restriction. Once referencing the dll we can do the following:
/// <summary>Gets the content of the word document</summary>
/// <param name="filePath">The path to the word document file</param>
/// <returns>The content of the document</returns>
public string ExtractText(string filePath)
{
if (string.IsNullOrEmpty(filePath))
throw new ArgumentNullException("filePath", "Input file path not specified.");
if (!File.Exists(filePath))
throw new FileNotFoundException("Input file not found at specified path.", "filepath");
var resultText = string.Empty;
Application wordApp = null;
try
{
wordApp = new Application();
var doc = wordApp.Documents.Open(filePath, Type.Missing, true);
if (doc != null)
{
if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text))
resultText = doc.Content.Text.Normalize();
doc.Close();
}
}
finally
{
if (wordApp != null)
wordApp.Quit(false, Type.Missing, false);
}
return resultText;
}
You will need to use some library to achieve what you want:
MS provides the OpenXML SDK V 2.0 (free, DOCX only)
Aspose.Words (commercial, DOC and DOCX)
IF you have lots of time on your hands then writing a .DOC parser might be thinkable - the .DOC spec can be found here.
BTW: Office Interop is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !
Assuming you mean to extract the text content of a doc file, there are a few command line tools as well as commercial libraries. A rather old tool that we once used to search doc (not docx) files (in combination with the search engine sphider) was catdoc (also here) which is a DOS rather than a Windows tool but nonetheless worked for us as long as we met the prerequisites (file name format 8.3).
A commercial product doc2txt if you can afford $29.
For the newer docx format, you can use the Perl based tool docx2txt.
Of course, if you want to run those tools from c#, you need to trigger an external Process - check here for a solid explanation.
A rather expensive, but very powerful tool to access doc and docx content is Spire.doc, but it does a lot more than you need. It is more convenient to use as it is a .NET library.
If you are referring to an older DOC file format then that is quite an issue because it is a MS specified binary file format and I must say I totally agree with the RQDQ's comment.
But if you are referring to a DOCX file format then you can achieve this without MS COM component or any other component, just pure .NET.
Check the following solutions:
http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files
http://www.dotnetspark.com/kb/Content.aspx?id=5633

Categories

Resources