Extract words from a doc/docx file c#

Extract words from a doc/docx file c# - c#

I want to extract all the words from a Word file (doc/docx) and put them into a list. It seems like microsoft.Office.Interop works just if i want to extract paragraphs and add them into a list.
List<string> data = new List<string>();
Microsoft.Office.Interop.Word.Application app = new
Microsoft.Office.Interop.Word.Application();
Document doc = app.Documents.Open(dlg.FileName);
foreach (Paragraph objParagraph in doc.Paragraphs)
data.Add(objParagraph.Range.Text.Trim());
((_Document)doc).Close();
((_Application)app).Quit();`
I also found the way to extract word by word but it didn't works with big document because of the loop that generates an exception.
`Dictionary<int, string> motRap = new Dictionary<int, string>();
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open("C:/Users/Titri/Desktop/test/test/bin/Debug/po.txt");
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
string text = document.Words[i].Text;
motRap.Add(i, text);
}
// Close word.
application.Quit();`
So my question is, if there is a way to extract words from a big word file. I think that Microsoft.Office.Interop is not the good tool to extract from a big file.
Sorry my english is not good.

The object inside a paragraph is called Run, though I don't know whether or not this is available in Interop. To enhance your experience performancewise, I would suggest you switch to using OpenXmlSdk, in case you have to process a large amount of documents.
If you want to stick to Interop, why don't you just split each paragraph into an array (delimiter obviously space) and add all the words after that?

Related

Word found unreadable content in docm

I am writing a program in C# using Open XML that transfers data from excel to word.
Currently, I have this:
internal override void UpdateSectionSheets(int sectionNum, List<List<string>> tableContents)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, true))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach(Table table in tables)
{
int row = 1;
if (table.Descendants<TableRow>().FirstOrDefault().Descendants<TableCell>().FirstOrDefault().InnerText == sectionNum.ToString())
{
foreach(var item in tableContents[0])
{
// splits the tableContents[0][row - 1] into individual strings at each instance of "\n\n"
String str = tableContents[0][row - 1];
String[] separator = {"\n\n"};
Int32 count = 6; // max 6 sub strings (usually only two but allowed for extra)
String[] subStrs = str.Split(separator, count, StringSplitOptions.RemoveEmptyEntries);
// transfer comment
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).RemoveAllChildren<Paragraph>(); // removes the existing contents in the cell
foreach (string s in subStrs)
{
// for every substring, create a new paragraph and append the substring to that new paragraph. Makes it so that each sentence is on its own line
Text text = new Text(s);
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
}
// transfer verdict
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).RemoveAllChildren<Paragraph>();
Paragraph p = new Paragraph(new ParagraphProperties(new Justification() { Val = JustificationValues.Center }));
p.Append(new Run(new Text(tableContents[1][row - 1])));
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).AppendChild(p);
row++;
}
}
}
doc.Save();
}
}
I believe the line causing the issue is: table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
If I put new Text(tableContents[0][row - 1]) in place of (text) in the above line, the program will run and word doc will open with no errors, but the output is not in the format I need.
The program runs without throwing any errors, but when I try to open the word doc it gives a "word found unreadable content in xxx.docm" error. If I say I trust the source and want word to recover the document, I can open the doc and see that the code is working how I want. However, I don't want to have to do that every time. Does anyone know what is causing the error and how I can fix it?

Extract bullets from word document using aspose.words in C#

I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.
I am using the code below to get the text with Heading1 styling and that works.
var heading1 = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
foreach (var head1 in heading1)
{
listBox11.Items.Add(head1.gettext()tostring());
}
I am trying to use the code below to get the text with bullet styling and this does NOT work.
var bullets = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
foreach (var bullet in bullets)
{
listBox19.Items.Add(bullet.GetText().ToString());
}
listBox19.Items.Add(bullet1.GetText().ToString());
I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.

Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.

I am now using this to succesfully extract the list items from a word file and put them into a listbox.
string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
// Open the document.
Document doc = new Document(fileName);
doc.UpdateListLabels();
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
// Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
// which start at three and ends at six.
foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
{
//listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");
// This is the text we get when getting when we output this node to text format.
// This text output will omit list labels. Trim any paragraph formatting characters.
string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
//remove the dot in front of the bullet
string bullet = paragraphText.Remove(0, 2);
listBox19.Items.Add(bullet);
ListLabel label = paragraph.ListLabel;
}

Search Text and highlight it

My code is in C#
I am using Aspose to search text and highlight it in pdf.
It is working but the time taken is very huge.
Example : My document has 25 pages and it has 25 instance of search text , 1 search text in each page.
It take 2 minutes which is unacceptable.
I have 3 questions:
Is it a way to reduce this time taken ?
Currently this approach is for pdf, in my case i have all types of doc (xls, pdf, ppt, doc)? Is there any way where this search and highlighting can be performed in all docs ?
Is there some better way of doing it other than aspose ?
// open document
Document document = new Document(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\OpenAML.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Martin");
//accept the absorber for all the pages
for (int i = 1; i <= document.Pages.Count; i++)
{
document.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
// textFragment.TextState.Invisible = false;
//textFragment.Text = "TEXT";
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 9;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
//textFragment.TextState.Underline = true;
}
}
// Save resulting PDF document.
document.Save(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\Highlightdoc.pdf");

Insert List<string> into word document

I'm currently trying to find a way to read in, and insert data into a word document. So far this is what I have gotten:
class Program
{
static void Main(string[] args)
{
var FileName = #"C:\temp\test.DOC";
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
foreach (Paragraph objParagraph in doc.Paragraphs)
{
data.Add(objParagraph.Range.Text.Trim());
}
//data.Insert
data.Insert(16, "Test 1");
data.Insert(16, "\tTest 2\tName\tAmount");
data.Insert(16, "Test 3");
data.Insert(16, "Test 4");
data.Insert(16, "Test 5");
data.Insert(16, "Test 6");
data.Insert(16, "Test 7");
data.Insert(16, "Test 8");
data.Insert(16, "Test 9");
data.Insert(16, "Test 10");
var x = doc.Paragraphs.Add();
x.Range.Text.Insert(0,"\tTest 2\tName\tAmount");
doc.SaveAs2(#"C:\temp\test3.DOC");
((_Document)doc).Close();
((_Application)app).Quit();
}
}
Now, this successfully populates the List data - but I'm trying to append each new test element at the [16]th index, and save it into the word document. Is there a simple way to accomplish this, or am I just over-thinking this issue?
I realize the string list is separate from the Document object which represents the word document.
I have a few other places in the document where I am using bookmarks to add data, but I don't think it is possible to use bookmarks for placing the data in this instance - or If I don't have to use bookmarks I'd like to stray away from that.
EDIT: I am trying to insert X amount of elements at the [16]th position within the data[].
EDIT 2:
Essentially I am sourcing the data dynamically, and I'm not sure how many records/rows I'll need to add to the document, so it could be as follows:
[15]
[16]\tName\tID\tAMOUNT
[17]\tName\tID\tAMOUNT
[18]\tName\tID\tAMOUNT
Since the headers will already be there (NAME,ID,AMOUNT), and each time I run the program I'm not sure how many elements I'll be inserting into the document - so as long as each element is placed under one another, and on the 16th line in the document template I have setup that should accomplish what I am trying to do.
Image 1 - Image into string array
Image 2 - Image after adding content into the string - this is what the resulting document. (this is to be saved)
I'm attempting to put each element ie: Test1 Test2 Test3 in their each own column each (see above)

Again I am totally confused as to why you want to read the word file into a string list array. This simply adds the text you show after line 15 into the word document. You do not specify WHERE Test 1, Test 2, Test3... are coming from.
Edit: Added a try-catch just in case the document does not have at least 16 paragraphs.
static void Main(string[] args)
{
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
string testRows = "Test 1\n\tTest 2\tName\tAmount\nTest 3\nTest 4\nTest 5\nTest 6\nTest 7\nTest 8\nTest 9\nTest 10\n";
try
{
var x = doc.Paragraphs[16];
x = doc.Paragraphs.Add(x.Range);
x.Range.Text = testRows;
doc.SaveAs2(#"C:\temp\test3.DOC");
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("COMException: " + e.StackTrace.ToString());
Console.ReadKey();
}
((_Document)doc).Close();
((_Application)app).Quit();
}

So what I figured out (for my purposes) is that is is easiest to insert a list of strings into makeshift columns separated by tabs by inserting at specific paragraphs.
Since I am using bookmarks to place text as well - I found it useful to work from a copy of a document instead of worrying about removing/creating bookmarks each time.
When populating the list that you are going to be placing at a specific paragraph mark it is useful to append tab characters as well as newline charters on the fly. Later on this will make it easier to loop through the list and place them nicely on the document.
Depending on the way you are going to go about placing columns some logic will have to be determined to space everything correctly. I did this by creating maximum lengths for columns and trimming, and accommodating for smaller/larger lengths by adding specific amounts of tab characters.
So, my columns I am populating would look like:
myList.Add("\t12345678912345\tJohn Doe\t\t\t\t123456\r\n");
myList.Add("\987654321654987\tJohn Smith\t\t\t\98765\r\n");
These lines would be inserted at paragraph 17 and placed neatly under headers.
Lastly, I decided to use bookmarks to place single lines of text like the date,title, and signature values since those values don't need to be correctly spaced or anything.
At the end I delete the copy of the word document I'm working on, and delete the pdf (since in my case I'm sending it via email)
Thank you for the help #JohnG - I hope this answer might help others who come across it. I removed the try-catch since I'm working from the template as well.
File.Copy(sCurrentPath + "\\" + "testTemplate.DOC", sCurrentPath + "\\" + "test.DOC");
Application app = new Application();
Document doc = app.Documents.Add(sCurrentPath + "\\" + "test.DOC");
foreach (string sValue in myList)
{
var List = doc.Paragraphs[17];
myList = doc.Paragraphs.Add(myList.Range);
myList.Range.Text = sValue;
}
if (doc.Bookmarks.Exists("Date"))
{
object oBookMark = "Date";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = DateTime.Now.ToString("MM/dd/yyyy");
}
if (doc.Bookmarks.Exists("Signature"))
{
object oBookMark = "Signature";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "My Name";
}
if (doc.Bookmarks.Exists("Title"))
{
object oBookMark = "Title";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "Title Here";
}
doc.ExportAsFixedFormat(sCurrentPath + "\\" + "test.pdf", WdExportFormat.wdExportFormatPDF);
File.Delete(sCurrentPath + "\\" + "testCopy.DOC");
File.Delete(sCurrentPath + "\\" + "test.pdf");
((_Document)doc).Close();
((_Application)app).Quit();

Fast way to get Word range as xml in C#?

my code:
// use MS-Word Interop
var rangeList = doc.Sentences.ofType<Range>().ToList();//About 4000 Sentences
foreach (var range in rangeList)
{
string sentXml=range.get_XML(false);//get all sentences xml,it's very slow ,about 18 min
//ConvertToFlowDocument(sentxml);
}
but it is very slow.
how do i convert range.WordOpenXML to openxml elements or etc...
yes,i also use range.Copy() to Convert to rtf string,it is also slow.

Sure, you can use Range.XML.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.office.interop.word.range.xml?view=word-pia
string sentXml = string.Empty;
//Use MS-W Word Interop
foreach (var range in doc.Sentences)
//You don't need to create a list variable because Sentences implements IEnumerable
string sentXml += range.XML; //Get XML for all sentences
//ConvertToFlowDocument(sentxml);
}
I'm not sure how you are going about it , but it would be helpful if you showed what the get_Xml method does. Based on your comments, it sounds like my method may be faster. You can open up a StreamReader, and then insert the XML inside.
WordprocessingDocument MainWordDocument;
using (StreamReader sr = new StreamReader(m_MainWordDocument.MainDocumentPart.GetStream()))
{
string docText = sr.ReadToEnd();
//Then, you can replace any text that you want
//string textToReplace("");
//docText = docText.Replace("what you want to replace", "some xml string that you extracted");
//Or you can just add it onto the end.
string docText = docText + sentXml;
//Open a StreamWriter and write overwrite the file
using (StreamWriter writer = new StreamWriter(filePath, true))
{
writer.Write(docText);
}
}
I don't think this is the most efficient way of doing things and don't know why someone would you use both WordInterop and OpenXML together, and this question is quite old, but this answers your question. If you think that there is a good use case for using the two together, I'd be interested to hear about it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract words from a doc/docx file c# - c#

Related

Word found unreadable content in docm

Extract bullets from word document using aspose.words in C#

Search Text and highlight it

Insert List<string> into word document

Fast way to get Word range as xml in C#?

Categories

Resources