Problem getting XML properly formatted - c#

I use classes (autogenerated from a schema) to generate xml documents. It has worked fine, until now, when I need to use inline HTML elements. I've tried several different methods, but as soon as I use the inline HTML, the "<" and ">" gets replaced with %lt; etc.
Example:
<meta>
<name>test</name>
<value>test <br />new row</value>
</meta>
becomes "destroyed" later on when trying to get it as a string for database storage, the value is changed to:
<value>test <br />new row</value>
How is it possible to keep the angle brackets intact?

You need to use CDATA sections for XML (or XML like) content.

The XML write is escaping the reserved characters such as <, > etc. If you're reading the text back using a Xml reader then your < will be correctly read as <.

Related

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.
In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)
In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.
I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Cleanup xml file - Invalid character in the given encoding

I am integrating against Magento ecommerce using their "SOAP" api, and the API returns "XML" results. Problem is, this is not always well formed:
<product>
<entity_id>18</entity_id>
<price regular="2925  <span>Nok</span>"/>
...
In this specific case, the price regular attribute has both an invisible character 0xa0 (before the span tag), and < > within the attribute text.
I have no way to get proper well-formed XML from Magento it seems, so the alternative is to clean it up before I feed it to my XmlSerializer deserialization:
XmlSerializer serializer = new XmlSerializer(typeof(Responses.Product.product));
product = serializer.Deserialize(textReader) as Responses.Product.product;
The invisible character I can get rid of using a simple text replace, but I'm more unsure about the <> within the attribute text.
My question is, how to clean it up for be valid XML?
The character 0x3c is the < character. For an invisible character you would rather be looking for something like the 0x09 TAB character.
To fix the broken markup you could look for that specific HTML tag in the content, using a regular expression to allow any currency within the tag:
xml = Regex.Replace(xml, "<span>([A-Za-z]{3})</span>", "<span>$1</span>");
This works as long as there isn't any span elements in the XML code itself, with a three character content. You could do similar replacements for other HTML tags, but try to keep the pattern as specific as possible, to avoid false positives.

How to keep < in Value when using XmlReader

XmlReader.Read converts
<
to <.
When reading this sample xml fragment <add >,
XML.NodeType isXmlNodeType.Text but XmlReader.Value contains <add >.
How can I retain the original format of <" add > ?
You can't.
If the actual content of that element is the escaped text, you need to further escape it in the XML, like this:
&lt;add&gt;
This will be properly read as
<add>
I can only guess you want to subsequently use the text 'plain' in another XML or HTML context.
The right answer is:
use an XmlWriter/XElement.ToString down the line, or
proper HtmlEncode it
Sidebar XML != Text, don't treat it as such. Don't cut/paste fragments. You'll run into brick wall with unparsed character data, different character sets, different encodings, repeated escaping or unbalanced escaping etc.
The XmlReader is supposed to read the Xml and give you the content. No other way about it.

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

XSLT: transfer xml with the closed tags

I'm using XSLT transfer an XML to a different format XML. If there is empty data with the element, it will display as a self-closing, eg. <data />, but I want output it with the closing tag like this <data></data>.
If I change the output method from "xml" to "html" then I can get the <data></data>, but I will lose the <?xml version="1.0" encoding="UTF-8"?> on the top of the document. Is this the correct way of doing this?
Many thanks.
Daoming
If you want this because you think that self closing tags are ugly, then get over it.
If you want to pass the output to some non-conformant XML Parser that is under control, then use a better parser, or fix the one you are using.
If it is out of your control, and you must send it to an inadequate XML Parser, then do you really need the prolog? If not, then html output method is fine.
If you do need the XML prolog, then you could use the html output method, and prepend the prolog after transformation, but before sending it to the deficient parser.
Alternatively, you could output it as XML with self-closing tags, and preprocess before sending it to your deficient parser with some kind of custom serialisation, using the DOM. If it can't handle self-closing tags, then I'm sure that isn't the only way in which it fails to parse XML. You might need to do something about namespaces, for example.
You could try adding an empty text node to any empty elements that you are outputting. That might do the trick.
Self-closed and explicitly closed elements are exactly the same thing in any regard whatsoever.
Only if somewhere along your processing chain there is a tool that is not XML aware (code that does XML processing with regex, for example), it might make a difference. At which point you should think about changing that part of the processing, instead of the XML generation/serialization part.

Categories

Resources