Include XML CDATA in an element - c#

UPDATE: Added more detail per request
I am trying to create an xml configuration file for my application. The file contains a list of criteria to search and replace in an html document. The problem is, I need to search for character strings like &nbsp. I do not want my code to read the decoded item, but the text itself.
Admitting to being very new to XML, I did make some attempts at meeting the requirements. I read a load of links here on Stackoverflow regarding CDATA and ATTRIBUTES and so on, but the examples here (and elsewhere) seem to focus on creating one single line in an xml file, not multiple.
Here is one of many attempts I have made to no avail:
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE item [
<!ELEMENT item (id, replacewith)>
<!ELEMENT id (#CDATA)>
<!ELEMENT replacewith (#CDATA)>
]>
]>
<item id=" " replacewith=" ">Non breaking space</item>
<item id="‑" replacewith="-">Non breaking hyphen</item>
This document gives me a number of errors, including:
In the DOCTYPE, I get errors like <!ELEMENT id (#CDATA)>. In the CDATA area, Visual Studio informs me it is expecting a ',' or '|'.
]> gives me an error of invalid token at the root of the document.
And of course, after the second <item entry, I get an error stating XML document cannot contain multiple root level elements.
How can I write an xml file that includes multiple items and allows me to store and retrieve the text within the element, rather than the interpreted characters?
If it helps any, I am using .Net, C#, and Visual Studio.
EDIT:
The purpose of this xml file is to provide my code with a list of things to search and replace in an html file. The xml file simply contains a list of what to search for and what to replace with.
Here is the file I have in place right now:
<?xml version="1.0" encoding="utf-8" ?>
<Items>
<item id="‑" replacewith="-">Non breaking hyphen</item>
<item id=" " replacewith=" ">Non breaking hyphen</item>
</Items>
Using the first as an example, I want to read the text &#8209 but instead when I read this, I get - because that is what the code represents.
Any help or pointers you can give would be helpful.

To elaborate on my comment: XML acts like HTML due to the reserved characters. An ampersand prefixes keywords or character codes to translate into a literal string when read in with any type of parser (browser, XML reader, etc).
The easiest way to escape the values to make sure they are read back in as the literal that you want is to put them in as if you were encoding it for web. For example, to create your XML document, I did this:
XmlDocument xmlDoc = new XmlDocument();
XmlElement xmlItem;
XmlAttribute xmlAttr;
XmlText xmlText;
// Declaration
XmlDeclaration xmlDec = xmlDoc.CreateXmlDeclaration("1.0", "UTF-8", null);
XmlElement xmlRoot = xmlDoc.DocumentElement;
xmlDoc.InsertBefore(xmlDec, xmlRoot);
// Items
XmlElement xmlItems = xmlDoc.CreateElement(string.Empty, "Items", string.Empty);
xmlDoc.AppendChild(xmlItems);
// Item #1
xmlItem = xmlDoc.CreateElement(string.Empty, "item", string.Empty);
xmlAttr = xmlDoc.CreateAttribute(string.Empty, "id", string.Empty);
xmlAttr.Value = "‑";
xmlItem.Attributes.Append(xmlAttr);
xmlAttr = xmlDoc.CreateAttribute(string.Empty, "replacewith", string.Empty);
xmlAttr.Value = "-";
xmlItem.Attributes.Append(xmlAttr);
xmlText = xmlDoc.CreateTextNode("Non breaking hyphen");
xmlItem.AppendChild(xmlText);
xmlItems.AppendChild(xmlItem);
// Item #2
xmlItem = xmlDoc.CreateElement(string.Empty, "item", string.Empty);
xmlAttr = xmlDoc.CreateAttribute(string.Empty, "id", string.Empty);
xmlAttr.Value = " ";
xmlItem.Attributes.Append(xmlAttr);
xmlAttr = xmlDoc.CreateAttribute(string.Empty, "replacewith", string.Empty);
xmlAttr.Value = " ";
xmlItem.Attributes.Append(xmlAttr);
xmlText = xmlDoc.CreateTextNode("Non breaking hyphen");
xmlItem.AppendChild(xmlText);
xmlItems.AppendChild(xmlItem);
// For formatting
StringBuilder xmlBuilder = new StringBuilder();
XmlWriterSettings xmlSettings = new XmlWriterSettings
{
Indent = true,
IndentChars = " ",
NewLineChars = "\r\n",
NewLineHandling = NewLineHandling.Replace
};
using (XmlWriter writer = XmlWriter.Create(xmlBuilder, xmlSettings))
{
xmlDoc.Save(writer);
}
xmlOutput.Text = xmlBuilder.ToString();
Notice that I put in your id values with what you are expecting. Now, look at how it gets encoded:
<?xml version="1.0" encoding="utf-16"?>
<Items>
<item id="&#8209;" replacewith="-">Non breaking hyphen</item>
<item id=" " replacewith="&nbsp;">Non breaking hyphen</item>
</Items>
The only difference between yours and this one is that the ampersand was encoded as & and the rest remained as a string literal. This is normal behavior for XML. When you read it back in, it will come back as the literal ‑ and .

Related

Accessing the xml tag with C# and updating the content

I have an xml file that I converted from pdf to xml.
Example XML looks as follows
<?xml version="1.0" encoding="UTF-8"?>
<office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:smil="urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" xmlns:anim="urn:oasis:names:tc:opendocument:xmlns:animation:1.0" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:officeooo="http://openoffice.org/2009/office" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:drawooo="http://openoffice.org/2010/draw" xmlns:calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0" xmlns:loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2" office:mimetype="application/vnd.oasis.opendocument.graphics">
<office:body>
<office:drawing>
<draw:page draw:name="page1" draw:style-name="dp2" draw:master-page-name="master-page3">
<draw:frame draw:style-name="gr9" draw:text-style-name="P10" draw:layer="layout" svg:width="1.242cm" svg:height="0.357cm" svg:x="17.055cm" svg:y="11.787cm">
<draw:text-box>
<text:p text:style-name="P2"><text:span text:style-name="T6">Example</text:span></text:p>
</draw:text-box>
</draw:frame>
</draw:page>
</office:drawing>
</office:body>
</office:document>
The C# code I'm trying to use:
XmlDocument doc = new XmlDocument();
XmlNamespaceManager namespaces = new XmlNamespaceManager(doc.NameTable);
namespaces.AddNamespace("xmlns:draw", "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0");
doc.Load("invoiceto.xml");
doc.SelectSingleNode("/draw:frame/draw:text-box/text:p/text:span", namespaces).InnerText = "new value";
I get this error
'' text 'namespace prefix not defined.'
I want to replace the text of example with C# but how can I get to the <text: span text: style-name = "T6"> tag with C#?
First of all the prefix added to XmlNamespaceManager shouldn't include the xmlns part. Then you also need to add the prefix text besides draw because both will be used in the XPath expression for calling SelectSingleNode. Last, since the element <draw:frame> isn't the root element you need to either specify full path starting from the root or start the XPath using // (the descendant-or-self axis) instead:
XmlNamespaceManager namespaces = new XmlNamespaceManager(doc.NameTable);
namespaces.AddNamespace("draw", "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0");
namespaces.AddNamespace("text", "urn:oasis:names:tc:opendocument:xmlns:text:1.0");
doc.SelectSingleNode("//draw:frame/draw:text-box/text:p/text:span", namespaces).InnerText = "new value";
dotnetfiddle demo

Add XElement to XML file with formatting and indenting

XML
Source XML
<!-- The comment -->
<Root xmlns="http://www.namespace.com">
<FirstElement>
</FirstElement>
<SecondElement>
</SecondElement>
</Root>
Desired XML
<!-- The comment -->
<Root xmlns="http://www.namespace.com">
<FirstElement>
</FirstElement>
<SecondElement>
</SecondElement>
<ThirdElement>
<FourthElement>thevalue</FourthElement>
</ThirdElement>
</Root>
Now my output XML is
<!-- The comment -->
<Root xmlns="http://www.namespace.com">
<FirstElement>
</FirstElement>
<SecondElement>
</SecondElement><ThirdElement><FourthElement>thevalue</FourthElement></ThirdElement>
</Root>
Note that I need to load the XML with LoadOptions.PreserveWhitespace as I need to preserve all whitespaces (desired by customer).
The desired output is to put 2 newlines after the last child element of the "root" and add with the proper indent
<ThirdElement>
<FourthElement>thevalue</FourthElement>
</ThirdElement>
Any ideas how to realize this?
Code
var xDoc = XDocument.Load(sourceXml, LoadOptions.PreserveWhitespace); //need to preserve all whitespaces
var mgr = new XmlNamespaceManager(new NameTable());
var ns = xDoc.Root.GetDefaultNamespace();
mgr.AddNamespace("ns", ns.NamespaceName);
if (xDoc.Root.HasElements)
{
xDoc.Root.Elements().Last().AddAfterSelf(new XElement(ns + "ThirdElement", new XElement(ns + "FourthElement", "thevalue")));
using (var xw = XmlWriter.Create(outputXml, new XmlWriterSettings() { OmitXmlDeclaration = true })) //omit xml declaration
xDoc.Save(xw);
}
Ideally, you should explain to your client that this really isn't important.
However, if your really need to mess around with whitespace, i'd note that XText is what you need. This is another XObject that represents text nodes and can be interspersed as part of your content. This is probably a much better approach than string manipulation.
For example:
doc.Root.Add(
new XText("\n\t"),
new XElement(ns + "ThirdElement",
new XText("\n\t\t"),
new XElement(ns + "FourthElement", "thevalue"),
new XText("\n\t")),
new XText("\n"));
See this demo.
My solution is to beautify just before saving by reparsing the document.
string content = XDocument.Parse(xDoc.ToString()).ToString();
File.WriteAllText(file, content, Encoding.UTF8);

How to get data from an XML File in C# using XMLDocument class?

Good Evening All, and happy weekend!.
I have been trying all day to understand how to parse my simple XML file so I can understand it enough to write a personal project I want to work on.
I have been reading articles on this site and others but cannot get past where I am :(
My XML Document is ...
<XML>
<User>
<ID>123456789</ID>
<Device>My PC</Device>
</User>
<History>
<CreationTime>27 June 2013</CreationTime>
<UpdatedTime>29 June 2013</UpdatedTime>
<LastUsage>30 June 2013</LastUsage>
<UsageCount>103</UsageCount>
</History>
<Configuration>
<Name>Test Item</Name>
<Details>READ ME</Details>
<Enabled>true</Enabled>
</Configuration>
</XML>
I am trying to get the value in the details element (READ ME). Below is my code
// Start Logging Progress
Console.WriteLine("Test Application - XML Parsing and Creating");
Console.ReadKey();
// Load XML Document
XmlDocument MyDoc = new XmlDocument(); MyDoc.Load(#"E:\MyXML.XML");
// Select Node
XmlNode MyNode = MyDoc.SelectSingleNode("XML/Configuration/Details");
// Output Node Value
Console.WriteLine(String.Concat("Details: ", MyNode.Value));
// Pause
Console.ReadKey();
My console application is running and outputing "Target: " but not giving me the detail within the element.
Can somebody see why this is happening, and perhaps give me advice if I am completely off the wheel? I have no previous knowledge in reading XML files; hence where I am now :)
Thanks! Tom
With the your XPATH expression
// Select Node
XmlNode MyNode = MyDoc.SelectSingleNode("XML/Configuration/Details");
your are selection an element so the type of the MyNode will be XmlElement but the Value of an XmlElement is always null (see on MSDN) so you need to use XmlElement.InnerText or XmlElement.InnerXml isntead.
So the changed your code to
// Output Node Value
Console.WriteLine(String.Concat("Details: ", MyNode.InnerText));
Or you can select the content of an element with using the XPATH text() function, in this case MyNode will be XmlText where you get its value with Value:
// Select Node
XmlNode MyNode = MyDoc.SelectSingleNode("XML/Configuration/Details/text()");
// Output Node Value
Console.WriteLine(String.Concat("Details: ", MyNode.Value));
As a sidenote if you are anyway learning XML manipulation in C# you should check out LINQ to XML which is another/newer way to working with XML in C#.
Just for interest, a little-known "simple" syntax is this:
XmlDocument myDoc = new XmlDocument();
myDoc.Load(#"D:\MyXML.XML");
string details = myDoc["XML"]["Configuration"]["Details"].InnerText;
Note that this (and the XPath approach) could go pop if your XML doesn't conform to the structure you're expecting, so you'd ideally put some validation in there as well.
U can use Xpath library for that (u must include "System.Xml.XPath"):
XmlDocument document = new XmlDocument();
document.Load("MyXml.xml");
XPathNavigator navigator = document.CreateNavigator();
foreach (XPathNavigator nav in navigator.Select("//Details"))
{
Console.WriteLine(nav.Value);
}
the above code iterate over every node called (Details) extracting information and print it.
If you want to retrieve a particular value from an XML file
XmlDocument _LocalInfo_Xml = new XmlDocument();
_LocalInfo_Xml.Load(fileName);
XmlElement _XmlElement;
_XmlElement = _LocalInfo_Xml.GetElementsByTagName("UserId")[0] as XmlElement;
string Value = _XmlElement.InnerText;
Value contains the text value

Need advice on removing xml row

I'm using a CMS and found a function to generate a rss feed from content within folders. However I would like one of the rows removing from the list. I've done my research and I 'think' I should be using XmlDocument class to help me remove the row I don't want. I've used Firebug and FirePath to get the XPath - but I cant seem to figure out how to apply it appropriately. I am also uncertain of whether I should be using .Load or .LoadXml - I've used the latter seing as though the feed displays fine. However I have had to convert ToString() to get rid of that overloaded match error....
The row I want removing is called "Archived Planes"
The XPath I get for FirePath is ".//*[#id='feedContent']/xhtml:div[11]/xhtml:h3/xhtml:a"
I am also assuming that .RemoveChild(node); will remove it out of rssData before I Response.Write. Thanks
Object rssData = new object();
Cms.UI.CommonUI.ApplicationAPI AppAPI = new Cms.UI.CommonUI.ApplicationAPI();
rssData = AppAPI.ecmRssSummary(50, true, "DateCreated", 0, "");
Response.ContentType = "text/xml";
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(rssData.ToString());
XmlNode node = xmlDocument.SelectSingleNode(#"xhtml:div/xhtml:h3/xhtml[a = 'Archived Planes']");
if (node != null)
{
node.ParentNode.RemoveChild(node);
}
Response.Write(rssData);
Edited to include output below
This is the what the response.write from rssData is pumping out:
<?xml version="1.0" ?>
<rss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0">
<channel>
<title>Plane feed</title>
<link>http://www.domain.rss1.aspx</link>
<description></description>
<item>
<title>New Planes</title>
<link>http://www.domainx1.aspx</link>
<description>
This is the description
</description>
<author>Andrew</author>
<pubDate>Thu, 16 Aug 2012 15:55:53 GMT</pubDate>
</item>
<item>
<title>Archived Planes</title>
<link>http://www.domain23.aspx</link>
<description>
Description of Archived Planes
</description>
<author>Jan</author>
<pubDate>Wed, 15 Aug 2012 10:34:23 GMT</pubDate>
</item>
</channel>
</rss>
I suspect your xpath is incorrect, it looks like some funky dom element that you are referencing and not the xml element... e.g. for the following xml
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<NewDataSet>
<userinfo>
<username>pqr2</username>
<pass>abc</pass>
<addr>abc</addr>
</userinfo>
<userinfo>
<username>pqr1</username>
<pass>pqr2</pass>
<addr>pqr3</addr>
</userinfo>
</NewDataSet>
This code will remove the userinfo node with an username element of pqr1
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.Load(#"file.xml");
XmlNode node = xmlDocument.SelectSingleNode(#"NewDataSet/userinfo[username = 'pqr1']");
if (node != null) {
node.ParentNode.RemoveChild(node);
xmlDocument.Save(#"file.xml");
}
Thought I would post the answer, although I'll mark Pauls as the answer, as his code/advice was the basis of this and my further research. Still don't know what the '#' in SelectSingleNode is and whether I should really have it - will do more research.
Object rssData = new object();
Cms.UI.CommonUI.ApplicationAPI AppAPI = new Cms.UI.CommonUI.ApplicationAPI();
rssData = AppAPI.ecmRssSummary(50, true, "DateCreated", 0, "");
Response.ContentType = "text/xml";
Response.ContentEncoding = System.Text.Encoding.UTF8;
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(rssData.ToString());
XmlNode node = xmlDocument.SelectSingleNode("rss/channel/item[title = 'Archived Planes']");
if (node != null)
try
{
node.ParentNode.RemoveChild(node);
xmlDocument.Save(Response.Output);
}
catch { }
else { Response.Write(rssData); }
}
The # symbol is simply to denote a verbatim string literal (allows you to have funky characters in the string compared to the normal string declaration) e.g.
string e = "Joe said \"Hello\" to me"; // Joe said "Hello" to me
string f = #"Joe said ""Hello"" to me"; // Joe said "Hello" to me
See this msdn link for more info

How to correctly parse an XML document with arbitrary namespaces

I am trying to parse somewhat standard XML documents that use a schema called MARCXML from various sources.
Here are the first few lines of an example XML file that needs to be handled...
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
<marc:record>
<marc:leader>00925njm 22002777a 4500</marc:leader>
and one without namespace prefixes...
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>01142cam 2200301 a 4500</leader>
Key point: in order to get the XPaths to resolve further along in the program I have to go through a regex routine to add the namespaces to the NameTable (which doesn't add them by default). This seems unnecessary to me.
Regex xmlNamespace = new Regex("xmlns:(?<PREFIX>[^=]+)=\"(?<URI>[^\"]+)\"", RegexOptions.Compiled);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlRecord);
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);
MatchCollection namespaces = xmlNamespace.Matches(xmlRecord);
foreach (Match n in namespaces)
{
nsMgr.AddNamespace(n.Groups["PREFIX"].ToString(), n.Groups["URI"].ToString());
}
The XPath call looks something like this...
XmlNode leaderNode = xmlDoc.SelectSingleNode(".//" + LeaderNode, nsMgr);
Where LeaderNode is a configurable value and would equal "marc:leader" in the first example and "leader" in the second example.
Is there a better, more efficient way to do this? Note: suggestions for solving this using LINQ are welcome, but I would mainly like to know how to solve this using XmlDocument.
EDIT: I took GrayWizardx's advice and now have the following code...
if (LeaderNode.Contains(":"))
{
string prefix = LeaderNode.Substring(0, LeaderNode.IndexOf(':'));
XmlNode root = xmlDoc.FirstChild;
string nameSpace = root.GetNamespaceOfPrefix(prefix);
nsMgr.AddNamespace(prefix, nameSpace);
}
Now there's no more dependency on Regex!
If you know there is going to be a given element in the document (for instance the root element) you could try using GetNamespaceOfPrefix.

Categories

Resources