Extract bullets from word document using aspose.words in C#

Extract bullets from word document using aspose.words in C# - c#

I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.
I am using the code below to get the text with Heading1 styling and that works.
var heading1 = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
foreach (var head1 in heading1)
{
listBox11.Items.Add(head1.gettext()tostring());
}
I am trying to use the code below to get the text with bullet styling and this does NOT work.
var bullets = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
foreach (var bullet in bullets)
{
listBox19.Items.Add(bullet.GetText().ToString());
}
listBox19.Items.Add(bullet1.GetText().ToString());
I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.

Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.

I am now using this to succesfully extract the list items from a word file and put them into a listbox.
string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
// Open the document.
Document doc = new Document(fileName);
doc.UpdateListLabels();
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
// Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
// which start at three and ends at six.
foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
{
//listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");
// This is the text we get when getting when we output this node to text format.
// This text output will omit list labels. Trim any paragraph formatting characters.
string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
//remove the dot in front of the bullet
string bullet = paragraphText.Remove(0, 2);
listBox19.Items.Add(bullet);
ListLabel label = paragraph.ListLabel;
}

Related

How to add text to existing paragraph without breaking the style in C#?

I have been trying to solve one problem in C# regarding updating paragraph text with some additional new text info:
I am not a C# developer, forgive me if the question is silly or easy to solve.
I have several paragraphs like this:
Alice is going to do some shopping.
Bob is a good guy.
Let's say, these paragraphs are written in Arial font with 11 pts. So I want to add some text after each paragraph.
The end result would be:
Alice is going to do some shopping.SomeText0
Bob is a good guy.SomeText1
I have tried this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
List<Paragraph> paragraphs = paragraphService.GetParagraphs(wordDoc);
foreach (Paragraph par in paragraphs)
{
string paragraphText = paragraphService.ParagraphToText(par);
paragraphText = textService.DeleteDoubleSpace(paragraphText);
if (paragraphText.Length != 0)
{
if (paragraphText == targetParagraph)
{
//Here I know that the added text will be corresponding to the my target paragraph.
//This paragraph comes from a JSON file but for simplicity I did not add that part.
par.Append(new Run(new Text("SomeText0")));
par.ParagraphProperties.CloneNode(true);
}
}
}
}
Adding the text works, but the style is not the same and some random style that I don't want. I want the newly added text to have the same font and size as the paragraph.
I have also tried several options, to make it Paragraph, just text, etc. But I could not find a solution.
Any help would be appreciated.

The open xml format stores paragraphs like the following
<w:p>
<w:r>
<w:t>String from WriteToWordDoc method.</w:t>
</w:r>
</w:p>
Here,
p is the element represented by the Paragraph class,
r is the element represented by Run class, and,
t is the element represented by the Text class.
So you are appending a new <w:r> => Run element which has its own format settings, and since you don't specify any formatting, defaults are used.
EDIT 1: And as it seems, when there are parts in this paragraph that are formatted differently, there can be multiple Run elements under a paragraph.
So, instead you can find the last Run element containing a Text element and modify its text.
foreach (Paragraph par in paragraphs)
{
Run[] runs = par.OfType<Run>().ToArray();
if (runs.Length == 0) continue;
Run[] runsWithText = runs.Where(x => x.OfType<Text>().ToArray().Length > 0).ToArray();
if (runsWithText.Length == 0) continue;
Text lastText = runsWithText.Last().OfType<Text>().Last();
lastText.Text += " Some Text 0";
}
Hope this helps.

Read text and format in a word document using Openxml

I've been trying to solve this my self but it seem that i really need help.
I am reading a Word document using OpenXml.
And i need the text in the word document and its format.
I have this code for getting the text and attributes
WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(stream, true);
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
foreach (var item in body)
{
//Console.WriteLine(">>text: " + item.InnerText);
foreach (var tt in item.GetAttributes())
{
Console.WriteLine(tt.LocalName + " : " + tt.Value);
}
}
And the output of the code above is something similar to this
rsidR : 0067182C
rsidP : 002A2C9A
rsidRDefault : 004052D2
rsidR : 0067182C
rsidRDefault : 004052D2
rsidR : 0067182C
rsidSect : 0067182C
What i need is the format used in each text in the word document. But what are those means
And this is the screenshot of my sample Word document. Can i retrieve it like a property Bold. Font Name. Font size?
enter image description here

Yes. You can get the formatting information for each text.
I am assuming that you have all the runs. Each run has runProperties which has all the formatting information.
So iterate over each run and get the formatting like below.
bool Border = run.RunProperties.Border != null,
bool Bold = run.RunProperties.Bold != null,

You can get with Descendants node and make loop on it then you can get what ever you are looking for.

Extract words from a doc/docx file c#

I want to extract all the words from a Word file (doc/docx) and put them into a list. It seems like microsoft.Office.Interop works just if i want to extract paragraphs and add them into a list.
List<string> data = new List<string>();
Microsoft.Office.Interop.Word.Application app = new
Microsoft.Office.Interop.Word.Application();
Document doc = app.Documents.Open(dlg.FileName);
foreach (Paragraph objParagraph in doc.Paragraphs)
data.Add(objParagraph.Range.Text.Trim());
((_Document)doc).Close();
((_Application)app).Quit();`
I also found the way to extract word by word but it didn't works with big document because of the loop that generates an exception.
`Dictionary<int, string> motRap = new Dictionary<int, string>();
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open("C:/Users/Titri/Desktop/test/test/bin/Debug/po.txt");
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
string text = document.Words[i].Text;
motRap.Add(i, text);
}
// Close word.
application.Quit();`
So my question is, if there is a way to extract words from a big word file. I think that Microsoft.Office.Interop is not the good tool to extract from a big file.
Sorry my english is not good.

The object inside a paragraph is called Run, though I don't know whether or not this is available in Interop. To enhance your experience performancewise, I would suggest you switch to using OpenXmlSdk, in case you have to process a large amount of documents.
If you want to stick to Interop, why don't you just split each paragraph into an array (delimiter obviously space) and add all the words after that?

Cannot change hyperlink style with Word interop without changing the style of the next paragraph

I have a document with a format similar to
Section Heading 1
Paragraph 1
...
Paragraph N
Sub Heading 1
Paragraph 1
...
Paragraph N
What I am trying to do is add a hyperlink from a heading to a reference document. I can add the hyperlink and apply a style to the link but the style gets applied to the section's Paragraph 1 as well as the hyperlink.
Note: WordApp is a singleton wrapper around Microsoft.Office.Interop.Word.Application. The HyperlinkDestionation class just holds the bookmark name and the path for the file that contains the bookmark.
private void LinkHeadings(string file)
{
Document doc = WordApp.Open(file);
for (int i = 1; i <= proposal.Paragraphs.Count; i++)
{
HyperlinkDestination dest = null;
Paragraph paragraph = proposal.Paragraphs[i];
paragraph.Range.Select();
Style style = (Style)paragraph.get_Style();
string styleString = ((Style)paragraph.get_Style()).NameLocal;
string headingText = paragraph.Range.Text.Split(' ')[0];
if (styleString.Contains("Heading"))
{
dest = _hyperlinkDestinations.Find(x => x.HyperlinkText == headingText);
}
if (dest != null)
{
Hyperlink link = WordApp.ActiveWindow.Document.Hyperlinks.Add(WordApp.Selection.Range, Address: dest.FilePath, SubAddress: dest.bookmarkName, TextToDisplay: WordApp.Selection.Text);
link.Range.set_Style(style);
}
}
WordApp.Close(true);
}
My guess is that it has something to do with with the hyperlink anchor. I've also tried deleting the heading first then inserting the hyperlink but it also has the same result.

The basic problem is that you are including the paragraph mark in the Hyperlink field that Word inserts. That pargraph mark will then be hidden when the hyperlink field result is displayed, i.e. the Section Heading 1 para. will actually become part of Paragraph 1. When you apply the style to the selection, the entire paragraph will be affected.
I'm not going to attempt to provide C# here, but here are some suggestions
a. as a rule it is better to work with Range objects in Word than the Selection where possible, and you should be able to do so here.
b. If you apply the Hyperlink to the paragraph without the paragraph marker, the paragraph style will be unchanged, so you should not need to re-apply it
c. So instead of the code starting with "paragraph.Range.Select();" you should be able to use something like this (I leave you to get the C# syntax right - perhaps you can edit this message)
Range r = Paragraph.Range();
string headingText = r.Text.Split(' ')[0];
if (styleString.Contains("Heading"))
// you shoul probably also tst for an empty paragraph here before inserting anything (I leave it to you)
{
dest = _hyperlinkDestinations.Find(x => x.HyperlinkText == headingText);
}
if (dest != null)
{
// Move the end of the range one character towards the beginning
r.MoveEnd(Word.WdUnits.WdCharacter,-1)
Hyperlink link = WordApp.ActiveWindow.Document.Hyperlinks.Add(r, Address: dest.FilePath, SubAddress: dest.bookmarkName, TextToDisplay: r.Text);
}
If your code needs to run internationally and you only need to check paragraphs with the built-in style types Heading 1..Heading 9, then it would also be better to compare the Style.Type to see if it is one of those 9 style types. If you have other style types called "Heading something" that need to be included, then you probably need to check both the Style.Type and the name.

C# - Read plain text from XML data containing Word fields

I am developing a 'Search' feature for an application wherein I search for a keyword within XML content. I need to search only for the plain text i.e no xml tags or word fields. Below is a snippet of the code I use to read the text (excluding the XML tags and binary data):
StringBuilder result = new StringBuilder();
var reader = System.Xml.XmlReader.Create(new System.IO.StringReader(strXmlContent));
while (reader.Read())
{
if (reader.Name == "pkg:binaryData" || reader.Name == "w:binData")
{
reader.Skip();
}
if (reader.NodeType == XmlNodeType.Text)
{
result.Append(reader.Value);
}
}
//Plain text without XML tags.
string plainText = result.ToString();
if (txt.ToLower().Contains(SearchText.ToLower()))
{
// display search results
}
However, I found that since this xml actually stores Word document content, it also contains Word fields such as : ( REF _Ref325306498 \h * MERGEFORMAT Figure 1 and REF _Ref325306499 \h * MERGEFORMAT Figure 2)
Here the content that I want to search is "(Figure 1 and Figure 2)".
But I am unable to find this text as it also contains MERGEFORMAT and other Word fields.
How can I read only plain text from this xml data?

After parsing each XML DOM element containing a Word file, you could parse the word document into a string and then use that for your search - there are a couple of ways provided to get the word document contents as a string in this other SO thread - essentially, you could either save the document as text using Word automation or use a third party library or use the Word DOM from within your code.

You can try with XElement and XPath. You need to add System.Xml.Linq and System.Xml.XPath namespaces in your using directives.
var xml = XElement.Load("filepath");
string searchText="your search text";
var matchElements=xml.XPathSelectElements(#"//*[contains(.,'"+searchText+"')]");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract bullets from word document using aspose.words in C# - c#

Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.

Related

How to add text to existing paragraph without breaking the style in C#?

Read text and format in a word document using Openxml

Extract words from a doc/docx file c#

Cannot change hyperlink style with Word interop without changing the style of the next paragraph

C# - Read plain text from XML data containing Word fields

Categories

Resources