How to embed xml in xml - c#

I need to embed an entire well-formed xml document within another xml document. However, I would rather avoid CDATA (personal distaste) and also I would like to avoid the parser that will receive the whole document from wasting time parsing the embedded xml. The embedded xml could be quite significant, and I would like the code that will receive the whole file to treat the embedded xml as arbitrary data.
The idea that immediately came to mind is to encode the embedded xml in base64, or to zip it. Does this sound ok?
I'm coding in C# by the way.

You could convert the XML to a byte array, then convert it to binary64 format. That will allow you to nest it in an element, and not have to use CDATA.

The W3C-approved way of doing this is XInclude. There is an implementation for .Net at http://mvp-xml.sourceforge.net/xinclude/

Just a quick note, I have gone the base64 route and it works just fine but it does come with a stiff performance penalty, especially under heavy usage. We do this with document fragments upto 20MB and after base64 encoding they can take upwards of 65MB (with tags and data), even with zipping.
However, the bigger issue is that .NET base64 encoding can consume up-to 10x the memory when performing the encoding/decoding and can frequently cause OOM exceptions if done repeatedly and/or done on multiple threads.
Someone, on a similar question recommended ProtoBuf as an option, as well as Fast InfoSet as another option.

Depending on how you construct the XML, one way is to not care about it and let the framework handle it.
XmlDocument doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\" encoding=\"utf-8\" ?><helloworld></helloworld>");
string xml = "<how><are><you reply=\"i am fine\">really</you></are></how>";
doc.GetElementsByTagName("helloworld")[0].InnerText = xml;
The output will be something like a HTMLEncoded string:
<?xml version="1.0" encoding="utf-8"?>
<helloworld><how><are><you
reply="i am fine">really</you></are></how>
</helloworld>

I would encode it in your favorite way (e.g. base64 or HttpServerUtility::UrlEncode, ...) and then embed it.

If you don't need the xml declaration (first line of the document), just insert the root element (with all childs) into the tree of the other xml document as a child of an existing element. Use a different namespace to seperate the inserted elements.

It seems that serialization is the recommended method.

Can't you use XSLT for this? Perhaps using xsl:copy or xsl:copy-of? This is what XSLT is for.

I use Comments for this :
<!-- your xml text -->
[EDITED]
If the embedded xml with comments, replace it with a different syntax.
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status code="0" msg="" cause="" />
<data>
<order type="07" user="none" attrib="..." >
<xmlembeded >
<!--
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status ret="000 "/>
<data>
<allxml_here />
<!** embedeb comments **>
</data>
<xml>
-->
</xmlembeded >
</order>
<context sessionid="12345678" scriptname="/from/..." attrib="..." />
</data>
</xml>

Related

How to parse an xml that has non-xml data in it

I am working with some xml in C# and am having some issues parsing an xml file due to the format it is in. It has non xml data in the file and I have no control over the format of this file. The file is "test.xml"(see below). I am only concerned with the xml portion of the data, but am unsure the best way to go about accessing it. Any thoughts or recommendations would be greatly appreciated.
Test data -1
Smith, 2234
##*j
Random--
#<?xml version="1.0" encoding="utf-16"?>
<ConfigMessage xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.Test.com/schemas/Test.test.Config">
<Config>
<Version>10</Version>
<Build>00520</Build>
<EnableV>false</EnableV>
<BuildL>22</BuildL>
<BuildP>\\testpath\test</BuildP>
</Config>
</ConfigMessage>
#
Put the whole file into a string that contains anything within the first '<' and the last '>' characters detected on the file. Then you can treat it as normal XML from there. If there's random non-XML elements throughout it though you will need to add additional logic to detect starting/stopping XML "blocks".
I can suggest you such solution: open your pseudo-xml like simple text-file, read whole text, after that, with using regex you ought to take xml document (part of primordial document that is able to be converted to XML [|startTag|any symbols|/endTag|]), put it into XDocument (in memory) and now parse it like XML-file.

AngleSharp and XHTML round-trip

I'm trying to parse an XHTML file using AngleSharp, make a change, then output it. However, I'm having some issues getting the output to match the input.
If I use the XML parser and either the XMLMarkupFormatter or the HtmlMarkupFormatter I get no self-closing tags (all are <img></img>) and no XML declaration.
If I use the HTML parser and the HTMLMarkupFormatter I get XML invalid self-closing tags (all are simply <img>) and no XML declaration.
If I use the HTML parser and the XMLMarkupFormatter I get nice self closing tags (<img />), and the XML declaration - however, the XML declaration is picked up as a comment and outputted as <!-- <?xml version="1.0" encoding="UTF-8"?> -->
Is there a way around this or do I need to write my own MarkupFormatter?
Simple answer: It sounds like you need to provide your own MarkupFormatter.
There has been some effort to come up with an XhtmlMarkupFormatter, but this component has unfortunately not been realized so far. I imagine such a component may combine the serialization from both, the existing HTML and the available XML formatter.
Maybe this issue on the AngleSharp repo helps you.

spaces in end tags

I have a problem for loading xml in c #.
XmlDocument doc = new XmlDocument();
string xmlText = File.ReadAllText("D:\\webservice_aspnet\\novo2.xml");
doc.PreserveWhitespace = true;
doc.LoadXml(xmlText);
Above do loading the file.
Original file:
<?xml version="1.0" encoding="UTF-8"?>
<teste>
<abc xmlns="xxx"/>
</teste>
When I try doc.InnerXml, and create a xml file, it looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<teste>
<abc xmlns="xxx" />
</teste>
See that a space was added here:
<abc xmlns="xxx" />
at the end of the tag. I know this does not alter the structure of the file, however I have a validation algorithm that file and I can not change anything or add a space.
I do not want to replace to fix this, because they are giants and files can lose information.
Anyone know how I can generate the identical file?
If you are reading XML using a parser (perhaps a home-brew parser) than can't handle all legal XML syntax, then you are storing up trouble, and your name will be cursed by anyone who inherits your code. Don't do it.
Don't try to fix the XML generation code to generate the subset of XML that your parser can handle. Fix your parser.
One way to fix your parser might be to add an XML canonicalization step as the first thing it does; canonicalization generates a well defined subset of XML that might (if you're lucky) correspond to the subset that your home-brew parser understands.
Perhaps you could try doc.Load(file) instead of doc.LoadXml(file).
Please try just using Replace function:
string YourXML=SomexmlContent;
string result=YourXML.Replace(" />","/>");
Hope this helps!

XSLT: transfer xml with the closed tags

I'm using XSLT transfer an XML to a different format XML. If there is empty data with the element, it will display as a self-closing, eg. <data />, but I want output it with the closing tag like this <data></data>.
If I change the output method from "xml" to "html" then I can get the <data></data>, but I will lose the <?xml version="1.0" encoding="UTF-8"?> on the top of the document. Is this the correct way of doing this?
Many thanks.
Daoming
If you want this because you think that self closing tags are ugly, then get over it.
If you want to pass the output to some non-conformant XML Parser that is under control, then use a better parser, or fix the one you are using.
If it is out of your control, and you must send it to an inadequate XML Parser, then do you really need the prolog? If not, then html output method is fine.
If you do need the XML prolog, then you could use the html output method, and prepend the prolog after transformation, but before sending it to the deficient parser.
Alternatively, you could output it as XML with self-closing tags, and preprocess before sending it to your deficient parser with some kind of custom serialisation, using the DOM. If it can't handle self-closing tags, then I'm sure that isn't the only way in which it fails to parse XML. You might need to do something about namespaces, for example.
You could try adding an empty text node to any empty elements that you are outputting. That might do the trick.
Self-closed and explicitly closed elements are exactly the same thing in any regard whatsoever.
Only if somewhere along your processing chain there is a tool that is not XML aware (code that does XML processing with regex, for example), it might make a difference. At which point you should think about changing that part of the processing, instead of the XML generation/serialization part.

Is this a valid XML file?

I need to send some XML element to some other services and I want to ensure my XML file is of elegant format so that other people could use their XML parser to parse the XML file.
For such kinds of XML file, is it elegant format, breaking any rules of XML? Not sure whether &#x4 is valid XML character sequences in .Net/C#?
I am confused about whether strings starts with $#x are all valid? If not all of them are valid, any ways to filter them out?
I am using VSTS 2008 + C# + .Net 3.5.
<?xml version="1.0" encoding="utf-8"?>
<Text></Text>
No. Character references must be terminated with semi-colons.
Update: Given that syntax error in the question has been corrected, see http://www.w3.org/TR/xml/#dt-charref for a description of what values are acceptable.
Frankly, I'd stick to UTF-8 for everything except ", <, > and &. It makes the XML itself more readable.
Use XML Validator. It shows the following error:
Error: Character reference must end with the ';' delimiter.
As others have suggested, there was a semi-colon missing, and use the validator. But also note that not all characters are legal, even if the input format is technically OK.
The following document if failed by the validator:
<?xml version="1.0" encoding="utf-8"?>
<Text></Text>
This one does validate:
<?xml version="1.0" encoding="utf-8"?>
<Text>2</Text>
For information on characters to use or avoid, this seems interesting.

Categories

Resources