Read a space character from an XML element - c#

A surprisingly simple question this time! :-) There's an XML file like this:
<xml>
<data> </data>
</xml>
Now I need to read exactly whatever is in the <data> element. Be it a single whitespace like U+0020. My naive guess:
XmlDocument xd = new XmlDocument();
xd.Load(fileName);
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText;
But that returns an empty string. The white space got lost. Any other data can be read just fine.
What do I need to do to get my space character here?
After browsing the web for a while, I tried reading the XML file with an XmlReader that lets me set XmlReaderSettings.IgnoreWhitespace = false but that didn't help.

You must use xml:space="preserve" in your XML, according to the W3C standards and the MSDN docs.
The W3C standards dictate that white space be handled differently
depending on where in the document it occurs, and depending on the
setting of the xml:space attribute. If the characters occur within the
mixed element content or inside the scope of the xml:space="preserve",
they must be preserved and passed without modification to the
application. Any other white space does not need to be preserved. The
XmlTextReader only preserves white space that occurs within an
xml:space="preserve" context.
XmlDocument xd = new XmlDocument();
xd.LoadXml(#"<xml xml:space=""preserve""><data> </data></xml>");
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText; // data == " "
Console.WriteLine(data == " "); //True
Tested HERE.

Related

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

How to read xml string ignoring header?

I want to read a xml string ignoring the header and the comments.
To ignore the comments it's simples and I found a solution here.
But I'm not finding any solution to ignore the header.
Let me give an example:
Consider this xml:
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Some comments -->
<Tag Attribute="3">
...
</Tag>
I want to read the xml to a string obtaining just the element "Tag" and others elements but withou the "xml version" and the comments.
The element "Tag" is only an example. Could exist many others.
So, I want only this:
<Tag Attribute="3">
...
</Tag>
The code that I've come so far:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
XmlReader reader = XmlReader.Create("...", settings);
xmlDoc.Load(reader);
And I'm not finding anything on XmlReaderSettings to do that.
Do I need to go node by node choosing only the ones I want? This setting does not exist?
EDIT 1:
Just to resume my problem. I need the contents of the xml to use in a CDATA of a WebService. When I'm sending comments or xml version, I'm getting an specific error of that part of xml. So I assume that when I read the xml without the version, header and comments I'll be good to go.
Here's a really simple solution.
using (var reader = XmlReader.Create(/*reader, stream, etc.*/)
{
reader.MoveToContent();
string content = reader.ReadOuterXml();
}
Well, it seems that there is no settings to ignore declaration, so I had to ignore it myself.
Here's the code I've written for those who might be interested:
private string _GetXmlWithoutHeadersAndComments(XmlDocument doc)
{
string xml = null;
// Loop through the child nodes and consider all but comments and declaration
if (doc.HasChildNodes)
{
StringBuilder builder = new StringBuilder();
foreach (XmlNode node in doc.ChildNodes)
if (node.NodeType != XmlNodeType.XmlDeclaration && node.NodeType != XmlNodeType.Comment)
builder.Append(node.OuterXml);
xml = builder.ToString();
}
return xml;
}
If you want to only get the Tag elements, you should just read the XML as normal, then find them using the XmlDocument's XPath capabilities.
For your xmlDoc object:
var nodes = xmlDoc.DocumentElement.SelectNodes("Tag");
You can then iterate through these like so:
foreach (XmlNode node in nodes) { }
Or, obviously, you could just put your SelectNodes query into the foreach loop, if you're never going to reuse the nodes object.
This will return all Tag elements within your XML document, and you can do whatever you see fit with them.
There's no need to ever encounter comments while using XmlDocument if you don't want to, and you're not going to end up getting results including either the header or the comments. Is there a particular reason you're trying to remove pieces of the XML before you begin parsing it?
Edit: Based on your edit, it seems like you're having a problem with the header giving an error when you try to pass it. You probably shouldn't straight-up remove the header, so your best option might be to change the header to one that you know works. You can change the header (declaration) like so:
XmlDeclaration xmlDeclaration;
xmlDeclaration = yourDocument.CreateXmlDeclaration(
yourVersion,
yourEncoding,
isStandalone);
yourDocument.ReplaceChild(xmlDeclaration, doc.FirstChild);

Decode CDATA section in C#

I have a bit of XML as follows:
<section>
<description>
<![CDATA[
This is a "description"
that I have formatted
]]>
</description>
</section>
I'm accessing it using curXmlNode.SelectSingleNode("description").InnerText but the value returns \r\n This is a "description"\r\n that I have formatted instead of This is a "description" that I have formatted.
Is there a simple way to get that sort of output from a CDATA section? Leaving the actual CDATA tag out seems to have it return the same way.
You can use Linq to read CDATA.
XDocument xdoc = XDocument.Load("YourXml.xml");
xDoc.DescendantNodes().OfType<XCData>().Count();
It's very easy to get the Value this way.
Here's a good overview on MSDN: http://msdn.microsoft.com/en-us/library/bb308960.aspx
for .NET 2.0, you probably just have to pass it through Regex:
string xml = #"<section>
<description>
<![CDATA[
This is a ""description""
that I have formatted
]]>
</description>
</section>";
XPathDocument xDoc = new XPathDocument(new StringReader(xml.Trim()));
XPathNavigator nav = xDoc.CreateNavigator();
XPathNavigator descriptionNode =
nav.SelectSingleNode("/section/description");
string desiredValue =
Regex.Replace(descriptionNode.Value
.Replace(Environment.NewLine, String.Empty)
.Trim(),
#"\s+", " ");
that trims your node value, replaces newlines with empty, and replaces 1+ whitespaces with one space. I don't think there's any other way to do it, considering the CDATA is returning significant whitespace.
I think the best way is...
XmlCDataSection cDataNode = (XmlCDataSection)(doc.SelectSingleNode("section/description").ChildNodes[0]);
string finalData = cDataNode.Data;
Actually i think is pretty much simple. the CDATA section it will be loaded in the XmlDocument like another XmlNode the difference is that this node is going to has the property NodeType = CDATA, wich it mean if you have the XmlNode node = doc.SelectSingleNode("section/description"); that node will have a ChildNode with the InnerText property filled the pure data, and there is you want to remove the especial characters just use Trim() and you will have the data.
The code will look like
XmlNode cDataNode = doc.SelectSingleNode("section/description").ChildNodes[0];
string finalData = cDataNode.InnerText.Trim();
Thanks
XOnDaRocks
A simpler form of #Franky's solution:
doc.SelectSingleNode("section/description").FirstChild.Value
The Value property is equivalent to the Data property of the casted XmlCDataSection type.
CDATA blocks are effectively verbatim. Any whitespace inside CDATA is significant, by definition, according to XML spec. Therefore, you get that whitespace when you retrieve the node value. If you want to strip it using your own rules (since XML spec doesn't specify any standard way of stripping whitespace in CDATA), you have to do it yourself, using String.Replace, Regex.Replace etc as needed.

How do I edit XML in C# without changing format/spacing?

I need an application that goes through an xml file, changes some attribute values and adds other attributes. I know I can do this with XmlDocument and XmlWriter. However, I don't want to change the spacing of the document. Is there any way to do this? Or, will I have to parse the file myself?
XmlDocument has a property PreserveWhitespace. If you set this to true insignificant whitespace will be preserved.
See MSDN
EDIT
If I execute the following code, whitespace including line breaks is preserved. (It's true that a space is inserted between <b and />)
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.LoadXml(
#"<a>
<b/>
</a>");
Console.WriteLine(doc.InnerXml);
The output is:
<a>
<b />
</a>
Insignificant whitespace will typically be thrown away or reformatted. So unless the XML file uses the xml:space="preserve" attribute on the nodes which shall preserve their exact whitespace, changing whitespace is OK per XML specifications.

Change the node names in an XML file using C#

I have a huge bunch of XML files with the following structure:
<Stuff1>
<Content>someContent</name>
<type>someType</type>
</Stuff1>
<Stuff2>
<Content>someContent</name>
<type>someType</type>
</Stuff2>
<Stuff3>
<Content>someContent</name>
<type>someType</type>
</Stuff3>
...
...
I need to change the each of the "Content" node names to StuffxContent; basically prepend the parent node name to the content node's name.
I planned to use the XMLDocument class and figure out a way, but thought I would ask if there were any better ways to do this.
(1.) The [XmlElement / XmlNode].Name property is read-only.
(2.) The XML structure used in the question is crude and could be improved.
(3.) Regardless, here is a code solution to the given question:
String sampleXml =
"<doc>"+
"<Stuff1>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff1>"+
"<Stuff2>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff2>"+
"<Stuff3>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff3>"+
"</doc>";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(sampleXml);
XmlNodeList stuffNodeList = xmlDoc.SelectNodes("//*[starts-with(name(), 'Stuff')]");
foreach (XmlNode stuffNode in stuffNodeList)
{
// get existing 'Content' node
XmlNode contentNode = stuffNode.SelectSingleNode("Content");
// create new (renamed) Content node
XmlNode newNode = xmlDoc.CreateElement(contentNode.Name + stuffNode.Name);
// [if needed] copy existing Content children
//newNode.InnerXml = stuffNode.InnerXml;
// replace existing Content node with newly renamed Content node
stuffNode.InsertBefore(newNode, contentNode);
stuffNode.RemoveChild(contentNode);
}
//xmlDoc.Save
PS: I came here looking for a nicer way of renaming a node/element; I'm still looking.
I used this method to rename the node:
/// <summary>
/// Rename Node
/// </summary>
/// <param name="parentnode"></param>
/// <param name="oldname"></param>
/// <param name="newname"></param>
private static void RenameNode(XmlNode parentnode, string oldChildName, string newChildName)
{
var newnode = parentnode.OwnerDocument.CreateNode(XmlNodeType.Element, newChildName, "");
var oldNode = parentnode.SelectSingleNode(oldChildName);
foreach (XmlAttribute att in oldNode.Attributes)
newnode.Attributes.Append(att);
foreach (XmlNode child in oldNode.ChildNodes)
newnode.AppendChild(child);
parentnode.ReplaceChild(newnode, oldNode);
}
The easiest way I found to rename a node is:
xmlNode.InnerXmL = newNode.InnerXml.Replace("OldName>", "NewName>")
Don't include the opening < to ensure that the closing </OldName> tag is renamed as well.
Perhaps a better solution would be to iterate through each node, and write the information out to a new document. Obviously, this will depend on how you will be using the data in future, but I'd recommend the same reformatting as FlySwat suggested...
<stuff id="1">
<content/>
</stuff>
I'd also suggest that using the XDocument that was recently added would be the best way to go about creating the new document.
I'll answer the higher question: why are you trying this using XmlDocument?
I Think the best way to accomplish what you aim is a simple XSLT file
that match the "CONTENTSTUFF" node and output a "CONTENT" node...
don't see a reason to get such heavy guns...
Either way, If you still wish to do it C# Style,
Use XmlReader + XmlWriter and not XmlDocument for memory and speed purposes.
XmlDocument store the entire XML in memory, and makes it very heavy for Traversing once...
XmlDocument is good if you access the element many times (not the situation here).
I am not an expert in XML, and in my case I just needed to make all tag names in a HTML file to upper case, for further manipulation in XmlDocument with GetElementsByTagName. The reason I needed upper case was that for XmlDocument the tag names are case sensitive (since it is XML), and I could not guarantee that my HTML-file had consistent case in the tag names.
So I solved it like this: I used XDocument as an intermediate step, where you can rename elements (i.e. the tag name), and then loaded that into a XmlDocument. Here is my VB.NET-code (the C#-coding will be very similar).
Dim x As XDocument = XDocument.Load("myFile.html")
For Each element In x.Descendants()
element.Name = element.Name.LocalName.ToUpper()
Next
Dim x2 As XmlDocument = New XmlDocument()
x2.LoadXml(x.ToString())
For my purpose it worked fine, though I understand that in certain cases this might not be a solution if you are dealing with a pure XML-file.
Load it in as a string and do a replace on the whole lot..
String sampleXml =
"<doc>"+
"<Stuff1>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff1>"+
"<Stuff2>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff2>"+
"<Stuff3>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff3>"+
"</doc>";
sampleXml = sampleXml.Replace("Content","StuffxContent")
The XML you have provided shows that someone completely misses the point of XML.
Instead of having
<stuff1>
<content/>
</stuff1>
You should have:/
<stuff id="1">
<content/>
</stuff>
Now you would be able to traverse the document using Xpath (ie, //stuff[id='1']/content/) The names of nodes should not be used to establish identity, you use attributes for that.
To do what you asked, load the XML into an xml document, and simply iterate through the first level of child nodes renaming them.
PseudoCode:
foreach (XmlNode n in YourDoc.ChildNodes)
{
n.ChildNode[0].Name = n.Name + n.ChildNode[0].Name;
}
YourDoc.Save();
However, I'd strongly recommend you actually fix the XML so that it is useful, instead of wreck it further.

Categories

Resources