C# - Read plain text from XML data containing Word fields - c#

I am developing a 'Search' feature for an application wherein I search for a keyword within XML content. I need to search only for the plain text i.e no xml tags or word fields. Below is a snippet of the code I use to read the text (excluding the XML tags and binary data):
StringBuilder result = new StringBuilder();
var reader = System.Xml.XmlReader.Create(new System.IO.StringReader(strXmlContent));
while (reader.Read())
{
if (reader.Name == "pkg:binaryData" || reader.Name == "w:binData")
{
reader.Skip();
}
if (reader.NodeType == XmlNodeType.Text)
{
result.Append(reader.Value);
}
}
//Plain text without XML tags.
string plainText = result.ToString();
if (txt.ToLower().Contains(SearchText.ToLower()))
{
// display search results
}
However, I found that since this xml actually stores Word document content, it also contains Word fields such as : ( REF _Ref325306498 \h * MERGEFORMAT Figure 1 and REF _Ref325306499 \h * MERGEFORMAT Figure 2)
Here the content that I want to search is "(Figure 1 and Figure 2)".
But I am unable to find this text as it also contains MERGEFORMAT and other Word fields.
How can I read only plain text from this xml data?

After parsing each XML DOM element containing a Word file, you could parse the word document into a string and then use that for your search - there are a couple of ways provided to get the word document contents as a string in this other SO thread - essentially, you could either save the document as text using Word automation or use a third party library or use the Word DOM from within your code.

You can try with XElement and XPath. You need to add System.Xml.Linq and System.Xml.XPath namespaces in your using directives.
var xml = XElement.Load("filepath");
string searchText="your search text";
var matchElements=xml.XPathSelectElements(#"//*[contains(.,'"+searchText+"')]");

Related

Extract bullets from word document using aspose.words in C#

I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.
I am using the code below to get the text with Heading1 styling and that works.
var heading1 = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
foreach (var head1 in heading1)
{
listBox11.Items.Add(head1.gettext()tostring());
}
I am trying to use the code below to get the text with bullet styling and this does NOT work.
var bullets = doc
.GetChildNodes(NodeType.Paragraph, true)
.Cast<Aspose.Words.Paragraph>()
.ToArray()
.Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
foreach (var bullet in bullets)
{
listBox19.Items.Add(bullet.GetText().ToString());
}
listBox19.Items.Add(bullet1.GetText().ToString());
I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.
Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.
I am now using this to succesfully extract the list items from a word file and put them into a listbox.
string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
// Open the document.
Document doc = new Document(fileName);
doc.UpdateListLabels();
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
// Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
// which start at three and ends at six.
foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
{
//listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");
// This is the text we get when getting when we output this node to text format.
// This text output will omit list labels. Trim any paragraph formatting characters.
string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
//remove the dot in front of the bullet
string bullet = paragraphText.Remove(0, 2);
listBox19.Items.Add(bullet);
ListLabel label = paragraph.ListLabel;
}

{"'\u0004', hexadecimal value 0x04, is an invalid character

I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data.
I have already this regex code still it's not working for me please help.
The code what I have tried:
string filedata = #"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = #"[^\u0000-\u007F]+";
string re5 = #"\p{Cs}";
data = Regex.Replace(input, re1, "");
data = Regex.Replace(input, re5, "");
XmlDocument xmlDocument = new XmlDocument();
try
{
xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.
I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?

replace new lines with "" in C#

I want to convert this:
<translation>
1 Sənədlər
</translation>
to
<translation>1 Sənədlər</translation> in XML using C#.
Please help me. Only translation tags.
I tried this:
XDocument xdoc = XDocument.Load(path);
xdoc.Save("path, SaveOptions.DisableFormatting);
But it does not remove the new lines between <translation> tags.
what you have should work. you can validate by dumping the XDocument to a string variable to confirm if the SaveOptions is removing the formatting.
for eg: i tried the below and content does not have any formatting including newlines and whitespaces.
XDocument xmlDoc = new XDocument(new XElement("Team", new XElement("Developer", "Sam")));
var content = xmlDoc.ToString(SaveOptions.DisableFormatting);
A new line is determined in the code by "\n" and possibly also "\r". You can simply remove these:
string xmlString = "<translation>\r\n1 Sənədlər\r\n</translation>"; // With the 'new lines'
xmlString = xmlString.Replace("\r", "").Replace("\n", "");
This will result in:
<translation>
1 Sənədlər
</translation>
Becomming:
<translation>1 Sənədlər</translation>
I hope this helps.
You can strip out newlines manually in an environment-sensitive way by using
var content = xmlString.Replace(Environment.NewLine, string.Empty)
XML defines two types of whitespace: significant and insignificant:
Insignificant whitespace is the whitespace between elements where text content doesn't occur, whereas significant whitespace is the whitespace within elements that contain text content. You might find the graphic in this article useful to show the difference.
What you have in your translation element is significant whitespace; the element contains text so it is assumed to be part of the element contents. Without a schema or DTD that says it can be collapsed, no amount of changing the whitespace handling on read or write is going to remove this. These options only relate to the insignificant whitespace.
What you can do is apply your own processing: using LINQ to XML, you can trim the whitespace of all elements that contain only text using something like this:
var textElements = doc.Descendants()
.Where(element => element.Nodes().All(node => node is XText));
foreach (var element in textElements)
{
element.Value = element.Value.Trim();
}
See this fiddle for a demo.

What would be the best way of checking whether a string contains XML tags?

I know that the following would find potential tags, but is there a better way to check if a string contains XML tags to prevent exceptions when reading/writing the string between XML files?
string testWord = "test<a>";
bool foundTag = Regex.IsMatch(testWord, #"^*<*>*$"));
I'd use another Regex for that
Regex.IsMatch(testWord, #"<.+?>");
However, even if it does match, there is no guarantee that your file actually is an xml file, as the regex could also match strings like "<<a>" which is invalid, or "a <= b >= c" which is obviously not xml.
You should consider using the XmlDocument class instead.
XmlDocument xmlDoc = new XmlDocument();
try
{
xmlDoc.Load(testWord);
}
catch
{
// not an xml
}
Why don't you HtmlEncode the string before sending it via XML? This way you can avoid difficulties with Regex parsing tags.

How to read a string containing XML elements without using the XML properties

I'm doing an XML reading process in my project. Where I have to read the contents of an XML file. I have achieved it.
Just out of curiosity, I also tried using the same by keeping the XML content inside a string and then read only the values inside the elemet tag. Even this I have achieved. The below is my code.
string xml = <Login-Form>
<User-Authentication>
<username>Vikneshwar</username>
<password>xxx</password>
</User-Authentication>
<User-Info>
<firstname>Vikneshwar</firstname>
<lastname>S</lastname>
<email>xxx#xxx.com</email>
</User-Info>
</Login-Form>";
XDocument document = XDocument.Parse(xml);
var block = from file in document.Descendants("client-authentication")
select new
{
Username = file.Element("username").Value,
Password = file.Element("password").Value,
};
foreach (var file in block)
{
Console.WriteLine(file.Username);
Console.WriteLine(file.Password);
}
Similarly, I obtained my other set of elements (firstname, lastname, and email). Now my curiosity draws me again. Now I'm thinking of doing the same using the string functions?
The same string used in the above code is to be taken. I'm trying not to use any XMl related classes, that is, XDocument, XmlReader, etc. The same output should be achieved using only string functions. I'm not able to do that. Is it possible?
Don't do it. XML is more complex than can appear the case, with complex rules surrounding nesting, character-escaping, named-entities, namespaces, ordering (attributes vs elements), comments, unparsed character data, and whitespace. For example, just add
<!--
<username>evil</username>
-->
Or
<parent xmlns=this:is-not/the/data/you/expected">
<username>evil</username>
</parent>
Or maybe the same in a CDATA section - and see how well basic string-based approaches work. Hint: you'll get a different answer to what you get via a DOM.
Using a dedicated tool designed for reading XML is the correct approach. At the minimum, use XmlReader - but frankly, a DOM (such as your existing code) is much more convenient. Alternatively, use a serializer such as XmlSerializer to populate an object model, and query that.
Trying to properly parse xml and xml-like data does not end well.... RegEx match open tags except XHTML self-contained tags
You could use methods like IndexOf, Equals, Substring etc. provided in String class to fulfill your needs, for more info Go here,
Using Regex is a considerable option too.
But it's advisable to use XmlDocument class for this purpose.
It can be done without regular expressions, like this:
string[] elementNames = new string[]{ "<username>", "<password>"};
foreach (string elementName in elementNames)
{
int startingIndex = xml.IndexOf(elementName);
string value = xml.Substring(startingIndex + elementName.Length,
xml.IndexOf(elementName.Insert(1, "/"))
- (startingIndex + elementName.Length));
Console.WriteLine(value);
}
With a regular expression:
string[] elementNames2 = new string[]{ "<username>", "<password>"};
foreach (string elementName in elementNames2)
{
string value = Regex.Match(xml, String.Concat(elementName, "(.*)",
elementName.Insert(1, "/"))).Groups[1].Value;
Console.WriteLine(value);
}
Of course, the only recommended thing is to use the XML parsing classes.
Build an extension method that will get the text between tags like this:
public static class StringExtension
{
public static string Between(this string content, string start, string end)
{
int startIndex = content.IndexOf(start) + start.Length;
int endIndex = content.IndexOf(end);
string result = content.Substring(startIndex, endIndex - startIndex);
return result;
}
}

Categories

Resources