C# junk characters break XElement "pretty" representation - c#

I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?

You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.

Related

Cleanup xml file - Invalid character in the given encoding

I am integrating against Magento ecommerce using their "SOAP" api, and the API returns "XML" results. Problem is, this is not always well formed:
<product>
<entity_id>18</entity_id>
<price regular="2925  <span>Nok</span>"/>
...
In this specific case, the price regular attribute has both an invisible character 0xa0 (before the span tag), and < > within the attribute text.
I have no way to get proper well-formed XML from Magento it seems, so the alternative is to clean it up before I feed it to my XmlSerializer deserialization:
XmlSerializer serializer = new XmlSerializer(typeof(Responses.Product.product));
product = serializer.Deserialize(textReader) as Responses.Product.product;
The invisible character I can get rid of using a simple text replace, but I'm more unsure about the <> within the attribute text.
My question is, how to clean it up for be valid XML?
The character 0x3c is the < character. For an invisible character you would rather be looking for something like the 0x09 TAB character.
To fix the broken markup you could look for that specific HTML tag in the content, using a regular expression to allow any currency within the tag:
xml = Regex.Replace(xml, "<span>([A-Za-z]{3})</span>", "<span>$1</span>");
This works as long as there isn't any span elements in the XML code itself, with a three character content. You could do similar replacements for other HTML tags, but try to keep the pattern as specific as possible, to avoid false positives.

Having trouble taking out all the newline, tab, and carriage return between two tags

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"
This is a sample XML file I'm reading:
<Consequence_Note>
<Text>In some cases, integer coercion errors can lead to exploitable buffer
overflow conditions, resulting in the execution of arbitrary
code.</Text>
</Consequence_Note>
and this
<Consequence_Scope>Availability</Consequence_Scope>
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.
I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"
For example:
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
What I would like to have is:
<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>
This is my code (I'm reading from a xml file):
String file = #"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, #">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);
Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!
Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.
The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.
If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx
A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:
\s{2,}
replace with
[ ]
That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

How to have the special entities like in the xml document output while using LinqXml

I am trying to generate a xml document using LinqXml, which has the "\n" to be "& #10;" in the XElement value, no matter whatever settings I try with the XmlWriter, it still emits a "\n" in the final xml document.
I did try the following, Extended the XmlWriter.
Overrided the XmlWriterSettings changed the NewLine Handling.
Both of the options didnt work out for me.
Any help/pointers will be appriciated.
Regards
Stephen
LINQ to XML works on top of XmlReader/XmlWriter. The XmlReader is an implementation of the XML processor/parser as described in the XML spec. That spec basically says that the parser needs to hide the actual representation in the text from the application above. Meaning that both \n and
should be reported as the same thing. That's what it does.
XmlWriter is the same thing backwards. It's purpose is to save the input in such a way, that when parsed you will get exactly the same thing back.
So writing a text value "\n" will write it such that the parser will report back "\n" (in this case the output text is \n for text node, but
for attribute due to normalization which occurs in attribute values).
Following that idea trying to write a text value "
" will actually write out "
" because when the reader parses that it will get back the original "
".
LINQ to XML uses XmlWriter to save the tree to an XML file. So you will get the above behavior.
You could write the tree into the XmlWriter yourself (or part of it) in which case you get more control. In particular it will allow you to use the XmlWriter.WriteCharEntity method which forces the writer to output the specified character as a character entity, that is in the $#xXX; format. (Note that it will use the hex format, not the decimal).
What is the reason for having the escaped value for '\n' in the XML element? The newline character is valid inside an XML element and when you parse the XML again, it will be parsed as you expect.
What you're looking for would happen if the newline character is placed within the value of an XML attribute:
XElement xEl = new XElement("Root",
new XAttribute("Value",
"Hello," + Environment.NewLine + "World!"));
Console.WriteLine(xEl);
Output:
<Root Value="Hello,
World!" />

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

How do I edit XML in C# without changing format/spacing?

I need an application that goes through an xml file, changes some attribute values and adds other attributes. I know I can do this with XmlDocument and XmlWriter. However, I don't want to change the spacing of the document. Is there any way to do this? Or, will I have to parse the file myself?
XmlDocument has a property PreserveWhitespace. If you set this to true insignificant whitespace will be preserved.
See MSDN
EDIT
If I execute the following code, whitespace including line breaks is preserved. (It's true that a space is inserted between <b and />)
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.LoadXml(
#"<a>
<b/>
</a>");
Console.WriteLine(doc.InnerXml);
The output is:
<a>
<b />
</a>
Insignificant whitespace will typically be thrown away or reformatted. So unless the XML file uses the xml:space="preserve" attribute on the nodes which shall preserve their exact whitespace, changing whitespace is OK per XML specifications.

Categories

Resources