Universal Microsoft Word document parser without Word installation

Universal Microsoft Word document parser without Word installation - c#

In order to parse : Microsoft Word 97/2003 (.doc) & Microsoft Word 2007/2010 (.docx) using C# and WPF without Word installation, I need to know if someone can give me a serious library to use in order to achieve that.
Technically I iterate throught ZipEntry elements like that :
foreach (string file in _listPathFiles)
{
using (Ionic.Zip.ZipFile zip = ZipFile.Read(file))
{
try
{
zip.ToList().ForEach(entry =>
{
if (entry.FileName.EndsWith(".doc") ||
entry.FileName.EndsWith(".docx"))
{
// Extract file into disk
entry.FileName = System.IO.Path.GetFileName(entry.FileName);
entry.Extract(baseStoragePath);
// Get data from file with Parser
string filePath = baseStoragePath + entry.FileName;
// Remove extracted filess
if (File.Exists(filePath))
{
File.Delete(filePath);
Console.WriteLine("Delete : " + filePath);
}
}
});
}
catch (Exception e)
{
Console.WriteLine("Fail to unzip Exception : " + e.StackTrace);
}
}
}
I'm not sure that I can use ZipEntry directly to get the document, may be I will have to unzip it before parsing ?!
And my goal is to get data located after a "Heading 1" Microsoft Word style, so the library should be able to get this kind of properties.
Libraries ideas and code samples are welcome..

GroupDocs.Parser for .NET can be used in your case for extracting the text from Word documents without installing MS Word. The extraction can be performed line by line or at once.
// extracting all the text
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());
// OR
// Extract text line by line
string line = extractor.ExtractLine();
// If the line is null, then the end of the file is reached
while (line != null)
{
// Print a line to the console
Console.Write(line);
// Extract another line
line = extractor.ExtractLine();
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Check out NPOI (A .NET Port of the Apache NOI API):
http://npoi.codeplex.com/
or
Download OpenXML SDK for reading Office Documents such as MS Word.

Related

How to solve the error of Word opening in background when trying to read text from Word documents?

I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}

When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.
You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.
The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :
using (var document = WordprocessingDocument.Open(fileName, false))
{
var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}
You'll find the Open XML documentation, including the strucrure of Word documents at MSDN
Avoiding Owner Files
Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.
To avoid these one only needs to check whether the filename starts with ~.
If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")
Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :
var dir=new DirectoryInfo(folderPath);
foreach(var file in dir.EnumerateFiles("*.docx"))
{
if (!file.Name.StartsWith("~"))
{
...
}
}
or, using LINQ :
var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
.Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
...
}
Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:
var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions
{
IgnoreInaccessible =true
};
var files=dir.EnumerateFiles("*.docx",options)
.Where(file=>!file.Name.StartsWith("~"));
One option is to
List item
List item

How can I debug my add-in when WordEditor's Content property is crashing Outlook?

I have folder full of *.msg files saved from Outlook and I'm trying to convert them to Word.
There is a loop that loads each *.msg as MailItem and saves them.
public static ConversionResult ConvertEmailsToWord(this Outlook.Application app, string source, string target)
{
var word = new Word.Application();
var emailCounter = 0;
var otherCounter = 0;
var directoryTree = new PhysicalDirectoryTree();
foreach (var node in directoryTree.Walk(source))
{
foreach (var fileName in node.FileNames)
{
var currentFile = Path.Combine(node.DirectoryName, fileName);
var branch = Regex.Replace(node.DirectoryName, $"^{Regex.Escape(source)}", string.Empty).Trim('\\');
Debug.Print($"Processing file: {currentFile}");
// This is an email. Convert it to Word.
if (Regex.IsMatch(fileName, #"\.msg$"))
{
if (app.Session.OpenSharedItem(currentFile) is MailItem item)
{
if (item.SaveAs(word, Path.Combine(target, branch), fileName))
{
emailCounter++;
}
item.Close(SaveMode: OlInspectorClose.olDiscard);
}
}
// This is some other file. Copy it as is.
else
{
Directory.CreateDirectory(Path.Combine(target, branch));
File.Copy(currentFile, Path.Combine(target, branch, fileName), true);
otherCounter++;
}
}
}
word.Quit(SaveChanges: false);
return new ConversionResult
{
EmailCount = emailCounter,
OtherCount = otherCounter
};
}
The save method looks likes this:
public static bool SaveAs(this MailItem mail, Word.Application word, string path, string name)
{
Directory.CreateDirectory(path);
name = Path.Combine(path, $"{Path.GetFileNameWithoutExtension(name)}.docx");
if (File.Exists(name))
{
return false;
}
var copy = mail.GetInspector.WordEditor as Word.Document;
copy.Content.Copy();
var doc = word.Documents.Add();
doc.Content.Paste();
doc.SaveAs2(FileName: name);
doc.Close();
return true;
}
It works for most *.msg files but there are some that crash Outlook when I call copy.Content on a Word.Document.
I know you cannot tell me what is wrong with it (or maybe you do?) so I'd like to findit out by myself but the problem is that I am not able to catch the exception. Since a simple try\catch didn't work I tried it with AppDomain.CurrentDomain.UnhandledException this this didn't catch it either.
Are there any other ways to debug it?
The mail that doesn't let me get its content inside a loop doesn't cause any troubles when I open it in a new Outlook window and save it with the same method.

It makes sense to add some delays between Word calls. IO operations takes some time to finish. Also there is no need to create another document in Word for copying the content:
var copy = mail.GetInspector.WordEditor as Word.Document;
copy.Content.Copy();
var doc = word.Documents.Add();
doc.Content.Paste();
doc.SaveAs2(FileName: name);
doc.Close();
Instead, do the required modifications on the original document instance and then save it to the disk. The original mail item will remain unchanged until you call the Save method from the Outlook object model. You may call the Close method passing the olDiscard which discards any changes to the document.
Also consider using the Open XML SDK if you deal with open XML documents only, see Welcome to the Open XML SDK 2.5 for Office for more information.

Do you actually need to use Inspector.WordEditor? You can save the message in a format supported by Word (such as MHTML) using OOM alone by calling MailItem.Save(..., olMHTML) and open the file in Word programmatically to save it in the DOCX format.

How to delete the last character of a file with C#

Hello I'm beginner with C# and I want to delete the last character of my file to inject JSON objects to this file manually (I know that's not the best way to do that), so I can get the right format I tried with multiple ways like open the file, manipulating the string (deleting the last character) and when I try to replace the text in that same file I have errors like IOException: The process cannot access the file 'file path' because it is being used by another process or System.UnauthorizedAccessException : 'Access to the path 'C:\Users\ASUS\Desktop\Root' is denied.
I'll show you the code :
StoreLogs Log = new StoreLogs()
{
Id = ID,
DateTime = dateT,
TaskName = task,
SrcAddress = srcPath,
DstAddress = path,
FileSize = DirSize(new DirectoryInfo(srcPath)),
DelayTransfer = ts.Milliseconds,
};
// Record JSON data in the variable
string strResultJson = JsonConvert.SerializeObject(Log);
// Show the JSON Data
// Console.WriteLine(strResultJson);
// Write JSON Data in another file
string MyJSON = null;
string strPath = #"C:\Users\ASUS\Desktop\Backup\logs\log.json";
if (File.Exists(strPath))
{
//FileInfo table = new FileInfo(strPath);
//string strTable = table.OpenText().ReadToEnd();
//string erase = strTable.Remove(strTable.LastIndexOf(']'));
//Console.WriteLine(erase);
//StreamReader r1 = new StreamReader(strPath);
//string strTable = r1.OpenText().ReadToEnd();
//string erase = strTable.Remove(strTable.LastIndexOf(']'));
//r1.Close();
using (StreamReader sr = File.OpenText(strPath))
{
string table = sr.ReadToEnd();
string erase = table.Remove(table.LastIndexOf(']'));
sr.Close();
File.WriteAllText(strPath, erase);
}
//MyJSON = "," + strResultJson;
//File.AppendAllText(strPath, MyJSON + "]");
//Console.WriteLine("The file exists.");
}
else if (!File.Exists(strPath))
{
MyJSON = "[" + strResultJson + "]";
File.WriteAllText(strPath, MyJSON);
Console.WriteLine("The file doesn't exists.");
}
else
{
Console.WriteLine("Error");
}
// End
Console.WriteLine("JSON Object generated !");
Console.ReadLine();
And that's the result I want :
[{"Id":"8484","DateTime":"26 novembre 2019 02:33:35 ","TaskName":"dezuhduzhd","SrcAddress":"C:\\Users\\ASUS\\Desktop\\Root","DstAddress":"C:\\Users\\ASUS\\Desktop\\Backup","FileSize":7997832.0,"DelayTransfer":0.0},{"Id":"8484","DateTime":"26 novembre 2019 02:33:35 ","TaskName":"dezuhduzhd","SrcAddress":"C:\\Users\\ASUS\\Desktop\\Root","DstAddress":"C:\\Users\\ASUS\\Desktop\\Backup","FileSize":7997832.0,"DelayTransfer":0.0},{"Id":"8484","DateTime":"26 novembre 2019 02:33:35 ","TaskName":"dezuhduzhd","SrcAddress":"C:\\Users\\ASUS\\Desktop\\Root","DstAddress":"C:\\Users\\ASUS\\Desktop\\Backup","FileSize":7997832.0,"DelayTransfer":0.0}]
Edit :
Thank you all for your advices
Solution:
FileStream fs = new FileStream(strPath, FileMode.Open, FileAccess.ReadWrite);
fs.SetLength(fs.Length - 1);
fs.Close();

In the code example you have posted you are opening a stream to read the file. A using block will dispose the stream after you exit the block. You are trying to write to the file, while the read stream is still accessing it (the read stream still exists). You've basically opened the file, you read from it, and are trying to write back to it while still holding it open. The reason this is a problem is that you are not using the stream to write. So your second, write, process is unable to access the file. I see you are closing the stream prior to write, but I'm willing to bet it's still holding the reference open.
I would try this method:
How to both read and write a file in C#

what it says is the access to the path (C:\Users\ASUS\Desktop\Root) denied for the user who is running the application. for ex: If you are running from Visual studio on user1 windows login then user1 should have appropriate rights to that root folder. If the code is running by itself (exe) then check the access for that user who is invoking that exe.

Based on the errors you posted seems that:
Maybe you're leaving some stream open pointing to the file you want to edit, use the 'using' statement to avoid this (see this link for more info)
You're trying to access a file when you don't have needed permissions (you aren't a system admin or file is read-only), try changing file ubication or setting it to be writeable (see this link for mor info about the UnauthorizedAccessException exception)
Hope this helps you!

How can I send docx and pdf file to printer?

I have the following code which prints text file from C# to printer its based on the this article it prints plain text perfect but when I try to print .docx and .pdf file it prints but convert the content to some-kind of encoded characters. How can I fix this to print pdf and doc file?
private void btnPrint_Click(object sender, EventArgs e)
{
// Select the desired printer. ps.Duplex = Duplex.Simplex; // This works
pdocFile.PrinterSettings.PrinterName = cboPrinter.Text;
pdocFile.PrinterSettings.Duplex = Duplex.Horizontal;
// Print the checked files.
foreach (string filename in clbFiles.CheckedItems)
{
Console.WriteLine("Printing: " + filename);
// Get the file's name without the path.
FileInfo file_into = new FileInfo(filename);
string short_name = file_into.Name;
// Set the PrintDocument's name for use by the printer queue.
pdocFile.DocumentName = short_name;
// Read the file's contents.
try
{
FileContents = File.ReadAllText(filename).Trim();
}
catch (Exception ex)
{
MessageBox.Show("Error reading file " + filename +
".\n" + ex.Message);
return;
}
// Print.
pdocFile.Print();
}
MessageBox.Show("Spooled " + clbFiles.CheckedItems.Count +
" files for printing.");
}
//
private string FileContents;
// Print a page of the text file.
private void pdocTextFile_PrintPage(object sender, PrintPageEventArgs e)
{
// Make a font for printing.
using (Font font = new Font("Courier New", 10))
{
// Make a StringFormat to align text normally.
using (StringFormat string_format = new StringFormat())
{
// See how much of the remaining text will fit.
SizeF layout_area = new SizeF(e.MarginBounds.Width, e.MarginBounds.Height);
int chars_fitted, lines_filled;
e.Graphics.MeasureString(FileContents, font,
layout_area, string_format,
out chars_fitted, out lines_filled);
// Print as much as will fit.
e.Graphics.DrawString(
FileContents.Substring(0, chars_fitted),
font, Brushes.Black, e.MarginBounds,
string_format);
// Remove the printed text from the string.
FileContents = FileContents.Substring(chars_fitted).Trim();
}
}
// See if we are done.
e.HasMorePages = FileContents.Length > 0;
}
see Image link below

Your example above is taking a binary file format and trying to print it using a method that uses plain text, which will not work. You have a few options on how you could approach this.
Some printers allow you to submit various file types directly to them over a protocol like FTP. And example of this can be seen here. This method works great in enterprise environments which have business printers but is limited to the file types supported by each printer, and each printer's unique requirements.
For some formats, you can use third-party libraries like iText in your C# code to handle the actual printing. This option gives you a ton of control over the formatting, with the overhead of having to maintain additional code for every file type you wish to support.
You can also use the example code posted here to utilize already installed applications. In this example, it takes advantage of the Print verb made available by Adobe Acrobat, Word, etc. You'll need to make sure the applications have their defaults and surface the correct verb (which typically correlates with the context menu when right-clicking on a file name). This method is probably the most straight-forward option to handle files as-is.

OpenXML documents, how do you know which is which, when there is no extention

What i have done for now, and which works is this:
private string DetermineOpenXML(string file)
{
try
{
SpreadsheetDocument doc = SpreadsheetDocument.Open(file, false);
doc.Close();
return ".xslx";
}
catch
{
try
{
WordprocessingDocument doc = WordprocessingDocument.Open(file, false);
doc.Close();
return ".docx";
}
catch
{
try
{
PresentationDocument doc = PresentationDocument.Open(file, false);
doc.Close();
return ".pptx";
}
catch
{
return string.Empty;
}
}
}
}
I think there must be a better way to see what kind of file it is other than just trial and error. The thing is, i am working on a small program, that finds out what file extension files should have. The reason i do this is because i have the files from a database, where they are saved sometimes without extension and other times with a wrong extension.
What i have done with these files is that i found out that all OpenXML documents share the same File signature: "50 4B 03 04 14 00 06 00", which is close to a signature of a zip file, and i can also open OpenXML files with a zip program and see its content. And maybe this is the solution i should go for, i was just hoping that it would be faster / easier to use OpenXML SDK and that it had a property or something that could check it for me.
Edit:
I have added a answer, i would still like to see if there was a better solution, even though there it works for my current purpose. It does not take in for account that the extensions should have been template files.

I ended up using System.IO.Packaging instead.
private string anotherOpenXmlAttempt(string file)
{
string ext = string.Empty;
Package package = Package.Open(file);
if (package.PartExists(new Uri("/word/document.xml", UriKind.Relative)))
{
ext = ".docx";
}
else if (package.PartExists(new Uri("/xl/workbook.xml", UriKind.Relative)))
{
ext = ".xslx";
}else if (package.PartExists(new Uri("/ppt/presentation.xml", UriKind.Relative)))
{
ext = ".pptx";
}
package.Close();
return ext;
}
havn't done any extensive testing, but have worked for my current files.
I will leave the question open in case someone has a nice solution.

From my experience of the OpenXMLSDK2 it is more useful for manipulating xml internals of the document. If you just need the extension type, then why not just use:
string extension = System.IO.Path.GetExtension(filename);
Its worth noting that the try catch is an expensive approach for just determining external details, as it will need all the exception details, stack trace etc for the catch block.
also Excel's extension type is .xslx not .xslt, that is "extensible stylesheet language transformations"
Hope that helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Universal Microsoft Word document parser without Word installation - c#

Check out NPOI (A .NET Port of the Apache NOI API): http://npoi.codeplex.com/ or Download OpenXML SDK for reading Office Documents such as MS Word.

Related

How to solve the error of Word opening in background when trying to read text from Word documents?

How can I debug my add-in when WordEditor's Content property is crashing Outlook?

How to delete the last character of a file with C#

How can I send docx and pdf file to printer?

OpenXML documents, how do you know which is which, when there is no extention

Categories

Resources