How to extract text data from MS-Word doc file

How to extract text data from MS-Word doc file - c#

i am developing a resume archive where people upload their resume and that resume will be saved in a specific location. the most important things is people may use any version of MS-word to prepare their resume and resume file extension could be doc or docx. so i just like to know is there any free library available which i can use to extract text data from doc or docx file which will work in case of all ms-word version and also work if ms-word is not install in pc. i search google and found some article to extract text data from doc file but i am not sure does they work in case of all ms-word version. so please guide me with info that which library i should use to extract data from ms-word irrespective of ms-word version also give me some good article link on this issue.
also guide me is there any viewer available which i can use to show doc file content from my c# apps irrespective of ms-word version.
thanks
i got the answer
**Need to add this reference Microsoft.Office.Interop.Word**
using System.Runtime.InteropServices.ComTypes;
using System.IO;
public static string GetText(string strfilename)
{
string strRetval = "";
System.Text.StringBuilder strBuilder = new System.Text.StringBuilder();
if (File.Exists(strfilename))
{
try
{
using (StreamReader sr = File.OpenText(strfilename))
{
string s = "";
while ((s = sr.ReadLine()) != null)
{
strBuilder.AppendLine(s);
}
}
}
catch (Exception ex)
{
SendErrorMail(ex);
}
finally
{
if (System.IO.File.Exists(strfilename))
System.IO.File.Delete(strfilename);
}
}
if (strBuilder.ToString().Trim() != "")
strRetval = strBuilder.ToString();
else
strRetval = "";
return strRetval;
}
public static string SaveAsText(string strfilename)
{
string fileName = "";
object miss = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document doc = null;
try
{
Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.Application();
fileName = Path.GetDirectoryName(strfilename) + #"\" + Path.GetFileNameWithoutExtension(strfilename) + ".txt";
doc = wordApp.Documents.Open(strfilename, false);
doc.SaveAs(fileName, Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatDOSText);
}
catch (Exception ex)
{
SendErrorMail(ex);
}
finally
{
if (doc != null)
{
doc.Close(ref miss, ref miss, ref miss);
System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
doc = null;
}
GC.Collect();
GC.WaitForPendingFinalizers();
}
return fileName;
}

See the following:
http://msdn.microsoft.com/en-us/library/cc974107%28office.12%29.aspx
How can i read .docx file?

Microsoft Interop Word Nuget
string docPath = #"C:\whereEverTheFileIs.doc";
Application app = new Application();
Document doc = app.Documents.Open(docPath);
string words = doc.Content.Text;
doc.Close();
app.Quit();

Related

how to transfer information from (.docx format) to (.docm format) using .Net-Core?

How to convert from DOCX to DOCM ?
In this document there is a converting from docm to docx.
https://learn.microsoft.com/en-us/office/open-xml/how-to-convert-a-word-processing-document-from-the-docm-to-the-docx-file-format
can we do the opposite ( DOCX to DOCM ) ?

public void ConvertDOCXtoDOCM(string fileName)
{
bool fileChanged = false;
using (WordprocessingDocument document =
WordprocessingDocument.Open(fileName, true))
{
document.ChangeDocumentType(
WordprocessingDocumentType.MacroEnabledDocument);
// Track that the document has been changed.
fileChanged = true;
}
// If anything goes wrong in this file handling,
// the code will raise an exception back to the caller.
if (fileChanged)
{
// Create the new .docx filename.
var newFileName = Path.ChangeExtension(fileName, ".docm");
// If it already exists, it will be deleted!
if (File.Exists(newFileName))
{
File.Delete(newFileName);
}
// Rename the file.
File.Move(fileName, newFileName);
}
}

How to update the title/text of Hyperlink present in Word document using Open XML SDK in C#

Is there any way to update the text/title of the existing hyperlink present in the word document using Open XML SDK in C#?
I am able to update the address/link of hyperlink present in the word document using below Code:
public static Stream GetAndUpdateAllHyperLinksInWordDocumentUsingOpenXMLSDKWithStream(Stream stream)
{
try
{
WordprocessingDocument doc =
WordprocessingDocument.Open(stream, true);
MainDocumentPart mainPart = doc.MainDocumentPart;
//Hyperlink hLink = mainPart.Document.Body.Descendants<Hyperlink>().FirstOrDefault();
var hLinks = mainPart.Document.Body.Descendants<Hyperlink>();
Console.WriteLine(hLinks.Count());
foreach (var hLink in hLinks)
{
if (hLink != null)
{
// get hyperlink's relation Id (where path stores)
string relationId = hLink.Id;
if (relationId != string.Empty)
{
// get current relation
var hr = mainPart.HyperlinkRelationships.FirstOrDefault(a => a.Id == relationId);
if (hr == null) continue;
var linkUrl = hr.Uri.ToString();
mainPart.DeleteReferenceRelationship(hr);
string linkToReplace = string.Empty;
try
{
//Link to Replace in the word document.
linkToReplace = "https://stackoverflow.com/";
mainPart.AddHyperlinkRelationship(new System.Uri(linkToReplace, System.UriKind.Absolute), true, relationId);
}
catch (Exception exc)
{
Console.WriteLine("Error occurred while adding target url, Error Message: " + exc.Message);
}
}
}
}
doc.Close();
return stream;
// return updated document stream
}
catch (Exception exc)
{
Console.WriteLine("Error Occurred: " + exc.Message);
return null;
}
}
I have searched a lot everywhere on the internet but couldn't find a way to update the title of the hyperlink in word document using Open XML SDK in C#.
Can anybody please suggest how to achieve the same?
Thanks in Advance.

docx to html [This command is not available.]

I need to convert the word document to HTML. I am able to do it with doc file
But when I use docx as input. I received an error
and here is my code
public static string DocToHtml(string path)
{
try
{
//I used Microsoft Interop v12 because this doesn't give me the Access Violation Error.
Microsoft.Office.Interop.Word.Application _App = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document _Doc = _App.Documents.Open(path);
//Let's save the converted document to the temp folder
string tempDocx = System.IO.Path.GetTempPath() + "_tempConvertedToHtml.html";
object _DocxFileName = tempDocx;
object FileFormat = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
_Doc.Convert();
_Doc.SaveAs(ref _DocxFileName, ref FileFormat);
//Close the Word interface
_Doc.Close();
_App.Quit();
path = tempDocx;
}
catch (Exception ex)
{
Console.WriteLine(ex.StackTrace);
throw ex;
}
return path;
}

Interop SaveAs bypasses input file extension

I am trying to develop an extension method which uses excel interop and converts any given input file into a new file in accord with an additional xlfileFormat input parametre.
The problem that I have found so far is that SaveAs method bypasses any arbitrary extension that I set, and sets It in accord with xlFileFormat options.
For example:
xlFileFormat = xlCsv, fileName= foo.arbitrary => saves it as
foo.arbitrary.csv
xlFileFormat = xlExcel8, fileName= extensionLessFoo => saves it
as extensionLessFoo.xls
xlFileFormat = xlOpenWorkbook, fileName= foo.xlsx => saves it as foo.xlsx (this one is OK)
I have been able to overcome this problem by specifying random GUID-based file names, and introducing this name as a SaveAs FileName parametre. Later, i will read final input workbook fullName, and return the recently created FileInfo
I would prefer not to depend on temporary files, but allow for specifying the file name AND the extension. so far, nor SaveCopyAs nor SaveAs have provided me a proper solution.
This is the method have been developing so far:
public static FileInfo InteropConvertTo(this FileInfo inputFile, XlFileFormat format)
{
var outputFileName = System.IO.Path.Combine(System.IO.Path.GetTempPath(), "Random SaveAs File -" + System.Guid.NewGuid().ToString());
var outputFile = new FileInfo(outputFileName);
try
{
//creation of a new, silent application
var hiddenApp = new Application();
hiddenApp.Visible = false;
hiddenApp.ScreenUpdating = false;
hiddenApp.DisplayAlerts = false;
//adding workbook, saving as new format, closing
var inputWorkbook = hiddenApp.Workbooks.Add(inputFile);
inputWorkbook.DoNotPromptForConvert = true;
inputWorkbook.SaveAs(Filename: outputFileName,
FileFormat: format , AccessMode:XlSaveAsAccessMode.xlNoChange, CreateBackup: false);
outputFile = new FileInfo(inputWorkbook.FullName);
outputFile.IsReadOnly = false;
xlWorkBook.Close(false);
xlApp.Quit();
releaseObject(hiddenApp );
releaseObject(inputWorkbook);
}
finally
{
GC.Collect();
}
return outputFile;
}
private static void releaseObject(object obj)
{
try
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(obj);
}
catch (Exception ex)
{
}
finally
{
obj = null;
GC.Collect();
}
}
Is there any way to use SaveAs forcing your own output file extension?

Excel 2003 VSTO convert to PDF

I have an excel workbook vsto solution that needs to generate a pdf copy of one of its sheets as output.
I have a license for abcdpdf .net and tried outputting to html, then using abcpdf to convert the html to pdf, but the excel html markup tries to emulate excel with all 4 worksheets with horrible markup. It also messes up the colors (silver background across entire workbook).
Any suggestions?
Here is the code I'm currently using to generate the html file:
FileInfo excelDoc = new FileInfo(Globals.ThisWorkbook.Path + #"\Document.html");
Globals.Sheet2.SaveAs(excelDoc.FullName,
Excel.XlFileFormat.xlHtml, missing, missing, false, false,
Excel.XlSaveAsAccessMode.xlNoChange,
missing, missing, missing);
If I hack away some of the html header tags manually, I can get abcdpf to accept it, but the formatting is a bit off and this solution seems sub optimal.
Thanks in advance.

Solution found: store excel sheet as XPS print out. Import XPS printout into pdf document.
MyImportOperation code adapted from abcpdf XPS sample source code.
public void SaveSheetToPdf(FileInfo outputPDF)
{
FileInfo documentFile = new FileInfo(Globals.ThisWorkbook.Path + #"\tempDoc.xps");
if (documentFile.Exists)
documentFile.Delete();
Globals.Sheet2.PrintOut(1, missing, 1, false, "Microsoft XPS Document Writer", true, false, documentFile.FullName);
Doc theDoc = new Doc();
try
{
MyImportOperation importOp = new MyImportOperation(theDoc);
importOp.Import(documentFile);
}
catch (Exception ex)
{
throw new Exception("Error rendering pdf. PDF Source XPS Path: " + investmentPlanXPSPath, ex);
}
theDoc.Save(outputPDF.FullName);
}
public class MyImportOperation
{
private Doc _doc = null;
private double _margin = 10;
private int _pagesAdded = 0;
public MyImportOperation(Doc doc)
{
_doc = doc;
}
public void Import(string inPath)
{
using (XpsImportOperation op = new XpsImportOperation())
{
op.ProcessingObject += Processing;
op.ProcessedObject += Processed;
op.Import(_doc, inPath);
}
}
public void Processing(object sender, ProcessingObjectEventArgs e)
{
if (e.Info.SourceType == ProcessingSourceType.PageContent)
{
_doc.Page = _doc.AddPage();
e.Info.Handled = true;
_pagesAdded++;
}
}
public void Processed(object sender, ProcessedObjectEventArgs e)
{
if (e.Successful)
{
PixMap pixmap = e.Object as PixMap;
if (pixmap != null)
pixmap.Compress();
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract text data from MS-Word doc file - c#

See the following: http://msdn.microsoft.com/en-us/library/cc974107%28office.12%29.aspx How can i read .docx file?

Microsoft Interop Word Nuget string docPath = #"C:\whereEverTheFileIs.doc"; Application app = new Application(); Document doc = app.Documents.Open(docPath); string words = doc.Content.Text; doc.Close(); app.Quit();

Related

how to transfer information from (.docx format) to (.docm format) using .Net-Core?

How to update the title/text of Hyperlink present in Word document using Open XML SDK in C#

docx to html [This command is not available.]

Interop SaveAs bypasses input file extension

Excel 2003 VSTO convert to PDF

Categories

Resources