How do I edit XML in C# without changing format/spacing? - c#

I need an application that goes through an xml file, changes some attribute values and adds other attributes. I know I can do this with XmlDocument and XmlWriter. However, I don't want to change the spacing of the document. Is there any way to do this? Or, will I have to parse the file myself?

XmlDocument has a property PreserveWhitespace. If you set this to true insignificant whitespace will be preserved.
See MSDN
EDIT
If I execute the following code, whitespace including line breaks is preserved. (It's true that a space is inserted between <b and />)
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.LoadXml(
#"<a>
<b/>
</a>");
Console.WriteLine(doc.InnerXml);
The output is:
<a>
<b />
</a>

Insignificant whitespace will typically be thrown away or reformatted. So unless the XML file uses the xml:space="preserve" attribute on the nodes which shall preserve their exact whitespace, changing whitespace is OK per XML specifications.

Related

C# junk characters break XElement "pretty" representation

I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?
You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.

Read a space character from an XML element

A surprisingly simple question this time! :-) There's an XML file like this:
<xml>
<data> </data>
</xml>
Now I need to read exactly whatever is in the <data> element. Be it a single whitespace like U+0020. My naive guess:
XmlDocument xd = new XmlDocument();
xd.Load(fileName);
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText;
But that returns an empty string. The white space got lost. Any other data can be read just fine.
What do I need to do to get my space character here?
After browsing the web for a while, I tried reading the XML file with an XmlReader that lets me set XmlReaderSettings.IgnoreWhitespace = false but that didn't help.
You must use xml:space="preserve" in your XML, according to the W3C standards and the MSDN docs.
The W3C standards dictate that white space be handled differently
depending on where in the document it occurs, and depending on the
setting of the xml:space attribute. If the characters occur within the
mixed element content or inside the scope of the xml:space="preserve",
they must be preserved and passed without modification to the
application. Any other white space does not need to be preserved. The
XmlTextReader only preserves white space that occurs within an
xml:space="preserve" context.
XmlDocument xd = new XmlDocument();
xd.LoadXml(#"<xml xml:space=""preserve""><data> </data></xml>");
XmlNode xn = xd.DocumentElement.SelectSingleNode("data");
string data = xn.InnerText; // data == " "
Console.WriteLine(data == " "); //True
Tested HERE.

How to have special characters in XML File and read it correctly?

I am creating an XML File where a tag's Innertext is \r\n
It is creating fine. But when i read these xml files in code-behind \r\n gets read as \n\n
I am reading it through XMLTextReader.
What could be the reason for this and how to read it like it was provided?
You need to encode such characters in order for them to be preserved.
One approach is to replace the characters with the respective character reference:
\n to
\r to
Another one would be to enclose sections that need to preserve whitespace in <![CDATA[]]> sections.
The question doesn't provide enough details to correctly answer it. So I will be assuming things here.
A few things should be considered:
Are you making use of different Locale settings?
What is the encoding of the XML file?
What is the platform encoding of the system?
What are the systems line separator(s)?
Are you using the innertext to force a line break in a document that is built up from this XML file?
Are you aware of the CDATA tag in the XML language?
These points can be of influence to your problem. But when re-reading your question, I think it is the CDATA solution you are looking for to keep your innertext from being misinterpreted by an XML parser.
My guess is to wrap the \r\n innertext in a CDATA tag.
More info on a CDATA tag can be found here: CDATA
Good luck!

xml editing with exact formatting

I need to add atribute in some specific cases to large xml. But i must preserve exact formatting -> even if the is some inconsistency in it
e.g.
name="one"/>
name = "one" />
or (length of indentation):
<newtag .../>
<newtag ... />
<netag .../>
Can anyone have idea how to do this? Setting preservewhitespaces in XMLDocument cause that reading some XMLNode fails (beacuse there is whitespace instead of < opening tag and the nullreferenceexception is thrown
There is no standard off-the-shelf parser for XML that I am aware of that will preserve details like the whitespace around the "=" sign between attributes.
I don't think this requirement is achievable except by writing your own XML parser (or modifying an existing one).

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

Categories

Resources