Read text and format in a word document using Openxml

Read text and format in a word document using Openxml - c#

I've been trying to solve this my self but it seem that i really need help.
I am reading a Word document using OpenXml.
And i need the text in the word document and its format.
I have this code for getting the text and attributes
WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(stream, true);
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
foreach (var item in body)
{
//Console.WriteLine(">>text: " + item.InnerText);
foreach (var tt in item.GetAttributes())
{
Console.WriteLine(tt.LocalName + " : " + tt.Value);
}
}
And the output of the code above is something similar to this
rsidR : 0067182C
rsidP : 002A2C9A
rsidRDefault : 004052D2
rsidR : 0067182C
rsidRDefault : 004052D2
rsidR : 0067182C
rsidSect : 0067182C
What i need is the format used in each text in the word document. But what are those means
And this is the screenshot of my sample Word document. Can i retrieve it like a property Bold. Font Name. Font size?
enter image description here

Yes. You can get the formatting information for each text.
I am assuming that you have all the runs. Each run has runProperties which has all the formatting information.
So iterate over each run and get the formatting like below.
bool Border = run.RunProperties.Border != null,
bool Bold = run.RunProperties.Bold != null,

You can get with Descendants node and make loop on it then you can get what ever you are looking for.

Related

Extract bullets from word document using aspose.words in C#

I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.
I am using the code below to get the text with Heading1 styling and that works.
var heading1 = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
foreach (var head1 in heading1)
{
listBox11.Items.Add(head1.gettext()tostring());
}
I am trying to use the code below to get the text with bullet styling and this does NOT work.
var bullets = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
foreach (var bullet in bullets)
{
listBox19.Items.Add(bullet.GetText().ToString());
}
listBox19.Items.Add(bullet1.GetText().ToString());
I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.

Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.

I am now using this to succesfully extract the list items from a word file and put them into a listbox.
string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
// Open the document.
Document doc = new Document(fileName);
doc.UpdateListLabels();
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
// Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
// which start at three and ends at six.
foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
{
//listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");
// This is the text we get when getting when we output this node to text format.
// This text output will omit list labels. Trim any paragraph formatting characters.
string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
//remove the dot in front of the bullet
string bullet = paragraphText.Remove(0, 2);
listBox19.Items.Add(bullet);
ListLabel label = paragraph.ListLabel;
}

How to add text to existing paragraph without breaking the style in C#?

I have been trying to solve one problem in C# regarding updating paragraph text with some additional new text info:
I am not a C# developer, forgive me if the question is silly or easy to solve.
I have several paragraphs like this:
Alice is going to do some shopping.
Bob is a good guy.
Let's say, these paragraphs are written in Arial font with 11 pts. So I want to add some text after each paragraph.
The end result would be:
Alice is going to do some shopping.SomeText0
Bob is a good guy.SomeText1
I have tried this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
List<Paragraph> paragraphs = paragraphService.GetParagraphs(wordDoc);
foreach (Paragraph par in paragraphs)
{
string paragraphText = paragraphService.ParagraphToText(par);
paragraphText = textService.DeleteDoubleSpace(paragraphText);
if (paragraphText.Length != 0)
{
if (paragraphText == targetParagraph)
{
//Here I know that the added text will be corresponding to the my target paragraph.
//This paragraph comes from a JSON file but for simplicity I did not add that part.
par.Append(new Run(new Text("SomeText0")));
par.ParagraphProperties.CloneNode(true);
}
}
}
}
Adding the text works, but the style is not the same and some random style that I don't want. I want the newly added text to have the same font and size as the paragraph.
I have also tried several options, to make it Paragraph, just text, etc. But I could not find a solution.
Any help would be appreciated.

The open xml format stores paragraphs like the following
<w:p>
<w:r>
<w:t>String from WriteToWordDoc method.</w:t>
</w:r>
</w:p>
Here,
p is the element represented by the Paragraph class,
r is the element represented by Run class, and,
t is the element represented by the Text class.
So you are appending a new <w:r> => Run element which has its own format settings, and since you don't specify any formatting, defaults are used.
EDIT 1: And as it seems, when there are parts in this paragraph that are formatted differently, there can be multiple Run elements under a paragraph.
So, instead you can find the last Run element containing a Text element and modify its text.
foreach (Paragraph par in paragraphs)
{
Run[] runs = par.OfType<Run>().ToArray();
if (runs.Length == 0) continue;
Run[] runsWithText = runs.Where(x => x.OfType<Text>().ToArray().Length > 0).ToArray();
if (runsWithText.Length == 0) continue;
Text lastText = runsWithText.Last().OfType<Text>().Last();
lastText.Text += " Some Text 0";
}
Hope this helps.

Passing a string gives a different ourcome to passing a string variable

I tried finding an answer for this but .
I have this function that is supposed to create a formatted paragraph.
When I pass it an html string like "<b>Test</b>" I get the bold text in the pdf as expected.
However when I pass a string variable with the same value I don't get a formatted text but instead I just get the original string in the pdf.
private Paragraph CreateSimpleHtmlParagraph(string text)
{
//Our return object
Paragraph p = new Paragraph();
//ParseToList requires a StreamReader instead of just text
using (StringReader sr = new StringReader(text))
{
//Parse and get a collection of elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
foreach (IElement e in elements)
{
//Add those elements to the paragraph
p.Add(e);
}
}
//Return the paragraph
return p;
}

Thanks so much guys. I checked the variable at runtime and it was in HTML format (eg: &lt instead of <). I had to use the HttpUtility.HtmlDecode function on the variable and that worked out perfectly.

C# - Read plain text from XML data containing Word fields

I am developing a 'Search' feature for an application wherein I search for a keyword within XML content. I need to search only for the plain text i.e no xml tags or word fields. Below is a snippet of the code I use to read the text (excluding the XML tags and binary data):
StringBuilder result = new StringBuilder();
var reader = System.Xml.XmlReader.Create(new System.IO.StringReader(strXmlContent));
while (reader.Read())
{
if (reader.Name == "pkg:binaryData" || reader.Name == "w:binData")
{
reader.Skip();
}
if (reader.NodeType == XmlNodeType.Text)
{
result.Append(reader.Value);
}
}
//Plain text without XML tags.
string plainText = result.ToString();
if (txt.ToLower().Contains(SearchText.ToLower()))
{
// display search results
}
However, I found that since this xml actually stores Word document content, it also contains Word fields such as : ( REF _Ref325306498 \h * MERGEFORMAT Figure 1 and REF _Ref325306499 \h * MERGEFORMAT Figure 2)
Here the content that I want to search is "(Figure 1 and Figure 2)".
But I am unable to find this text as it also contains MERGEFORMAT and other Word fields.
How can I read only plain text from this xml data?

After parsing each XML DOM element containing a Word file, you could parse the word document into a string and then use that for your search - there are a couple of ways provided to get the word document contents as a string in this other SO thread - essentially, you could either save the document as text using Word automation or use a third party library or use the Word DOM from within your code.

You can try with XElement and XPath. You need to add System.Xml.Linq and System.Xml.XPath namespaces in your using directives.
var xml = XElement.Load("filepath");
string searchText="your search text";
var matchElements=xml.XPathSelectElements(#"//*[contains(.,'"+searchText+"')]");

Rich Text to Plain Text via C#?

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}

I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.

Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;

Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx

Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read text and format in a word document using Openxml - c#

You can get with Descendants node and make loop on it then you can get what ever you are looking for.

Related

Extract bullets from word document using aspose.words in C#

How to add text to existing paragraph without breaking the style in C#?

Passing a string gives a different ourcome to passing a string variable

C# - Read plain text from XML data containing Word fields

Rich Text to Plain Text via C#?

Categories

Resources