Decode CDATA section in C# - c#

I have a bit of XML as follows:
<section>
<description>
<![CDATA[
This is a "description"
that I have formatted
]]>
</description>
</section>
I'm accessing it using curXmlNode.SelectSingleNode("description").InnerText but the value returns \r\n This is a "description"\r\n that I have formatted instead of This is a "description" that I have formatted.
Is there a simple way to get that sort of output from a CDATA section? Leaving the actual CDATA tag out seems to have it return the same way.

You can use Linq to read CDATA.
XDocument xdoc = XDocument.Load("YourXml.xml");
xDoc.DescendantNodes().OfType<XCData>().Count();
It's very easy to get the Value this way.
Here's a good overview on MSDN: http://msdn.microsoft.com/en-us/library/bb308960.aspx
for .NET 2.0, you probably just have to pass it through Regex:
string xml = #"<section>
<description>
<![CDATA[
This is a ""description""
that I have formatted
]]>
</description>
</section>";
XPathDocument xDoc = new XPathDocument(new StringReader(xml.Trim()));
XPathNavigator nav = xDoc.CreateNavigator();
XPathNavigator descriptionNode =
nav.SelectSingleNode("/section/description");
string desiredValue =
Regex.Replace(descriptionNode.Value
.Replace(Environment.NewLine, String.Empty)
.Trim(),
#"\s+", " ");
that trims your node value, replaces newlines with empty, and replaces 1+ whitespaces with one space. I don't think there's any other way to do it, considering the CDATA is returning significant whitespace.

I think the best way is...
XmlCDataSection cDataNode = (XmlCDataSection)(doc.SelectSingleNode("section/description").ChildNodes[0]);
string finalData = cDataNode.Data;

Actually i think is pretty much simple. the CDATA section it will be loaded in the XmlDocument like another XmlNode the difference is that this node is going to has the property NodeType = CDATA, wich it mean if you have the XmlNode node = doc.SelectSingleNode("section/description"); that node will have a ChildNode with the InnerText property filled the pure data, and there is you want to remove the especial characters just use Trim() and you will have the data.
The code will look like
XmlNode cDataNode = doc.SelectSingleNode("section/description").ChildNodes[0];
string finalData = cDataNode.InnerText.Trim();
Thanks
XOnDaRocks

A simpler form of #Franky's solution:
doc.SelectSingleNode("section/description").FirstChild.Value
The Value property is equivalent to the Data property of the casted XmlCDataSection type.

CDATA blocks are effectively verbatim. Any whitespace inside CDATA is significant, by definition, according to XML spec. Therefore, you get that whitespace when you retrieve the node value. If you want to strip it using your own rules (since XML spec doesn't specify any standard way of stripping whitespace in CDATA), you have to do it yourself, using String.Replace, Regex.Replace etc as needed.

Related

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

Read a space character from an XML element

A surprisingly simple question this time! :-) There's an XML file like this:
<xml>
<data> </data>
</xml>
Now I need to read exactly whatever is in the <data> element. Be it a single whitespace like U+0020. My naive guess:
XmlDocument xd = new XmlDocument();
xd.Load(fileName);
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText;
But that returns an empty string. The white space got lost. Any other data can be read just fine.
What do I need to do to get my space character here?
After browsing the web for a while, I tried reading the XML file with an XmlReader that lets me set XmlReaderSettings.IgnoreWhitespace = false but that didn't help.
You must use xml:space="preserve" in your XML, according to the W3C standards and the MSDN docs.
The W3C standards dictate that white space be handled differently
depending on where in the document it occurs, and depending on the
setting of the xml:space attribute. If the characters occur within the
mixed element content or inside the scope of the xml:space="preserve",
they must be preserved and passed without modification to the
application. Any other white space does not need to be preserved. The
XmlTextReader only preserves white space that occurs within an
xml:space="preserve" context.
XmlDocument xd = new XmlDocument();
xd.LoadXml(#"<xml xml:space=""preserve""><data> </data></xml>");
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText; // data == " "
Console.WriteLine(data == " "); //True
Tested HERE.

Use ToString or Value to get string value of an xml doc

If I have a System.Xml.XmlDocument, and I return the value of it, would I get the literal string value of the entire doc?
Or do I need to use ToString for that?
You really should start by reading the documentation of XmlDocument.
It will show you that you can use the OuterXml property.
Gets the markup representing this node and all its child nodes.
If you have -
XmlDocument doc;
that contains valid XML,
you can get its XML string using -
XmlNode root = doc.FirstChild;
Console.WriteLine(root.FirstChild.OuterXml);

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

How do I edit XML in C# without changing format/spacing?

I need an application that goes through an xml file, changes some attribute values and adds other attributes. I know I can do this with XmlDocument and XmlWriter. However, I don't want to change the spacing of the document. Is there any way to do this? Or, will I have to parse the file myself?
XmlDocument has a property PreserveWhitespace. If you set this to true insignificant whitespace will be preserved.
See MSDN
EDIT
If I execute the following code, whitespace including line breaks is preserved. (It's true that a space is inserted between <b and />)
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.LoadXml(
#"<a>
<b/>
</a>");
Console.WriteLine(doc.InnerXml);
The output is:
<a>
<b />
</a>
Insignificant whitespace will typically be thrown away or reformatted. So unless the XML file uses the xml:space="preserve" attribute on the nodes which shall preserve their exact whitespace, changing whitespace is OK per XML specifications.

Categories

Resources