XML Parsing - Removing nodes and keeping the overall structure

XML Parsing - Removing nodes and keeping the overall structure - c#

so I have a massive XML file that contains a structure similar to this one:
<Item ItemId=";ResTVersion" ItemType="0" PsrId="245" Leaf="false">
<Disp Icon="Str" Expand="true" Disp="true" LocTbl="false" Order="13352" />
<Modified By="sachink" DateTime="2008-12-16T19:02:35Z" />
<PsrProps>
<Str Name="Kii" Val="yyyyyyyyyyyyy" />
</PsrProps>
<Item ItemId=";ResTFileVersion" ItemType="0;ResT" PsrId="245" InstFlg="true" Leaf="true">
<Str Cat="Text" UsrLk="true">
<Val><![CDATA[ttttttttt]]></Val>
<Tgt Cat="Text" Orig="New">
<Val><![CDATA[ttttttttt]]></Val>
</Tgt>
</Str>
<Disp Icon="Str" Order="13353" />
<Modified By="sachink" DateTime="2008-12-16T19:02:35Z" />
<Cmts>
<Cmt Name="Dev"><![CDATA[{Locked}]]></Cmt>
</Cmts>
</Item>
<Item ItemId=";ResTLanguageTag" ItemType="0;ResT" PsrId="245" InstFlg="true" Leaf="true">
<Str Cat="Text" UsrLk="true">
<Val><![CDATA[en-US]]></Val>
<Tgt Cat="Text" Orig="New">
<Val><![CDATA[en-US]]></Val>
</Tgt>
</Str>
<Disp Icon="Str" Order="13354" />
<Modified By="sachink" DateTime="2008-12-16T19:02:35Z" />
<Cmts>
<Cmt Name="Dev"><![CDATA[=.ABVUDHUIDSHFUIDSHFUISHDFUIDSH iusdhfUIHAs]]></Cmt>
</Cmts>
</Item>
</Item>
I have several item ids and I want to create a new xml that respects the old structure.
I use this code to retrieve the nodes that I want, and then create the new XML.
XmlNodeList nodes = originalXML.SelectNodes("//*[contains(#ItemId,'" + id + "')]");
So what I want is to remove some nodes but I only have the ids of the ones I want to keep.
The problem is how do you keep the outer structure of an xml, when you use the selectnodes function, to get the inner nodes?
Thanks!

I would go the opposite way: remove what you don't need. It's hard to build an XmlDocument from scratch (takes a lot of coding).

I think you will be better off removing the nodes you don't want from the structure you already have.
XmlNodeList nodes = originalXML.SelectNodes("/*[not(contains(#test,'test'))]")
foreach(XmlNode node in nodes)
originalXML.RemoveChild(node);
should work.
If you need to preserve your original structure you can originalXML.Clone() it.
Sidenote: You might want to look into System.Xml.Linq.XDocument and System.Xml.Linq.XElement I find those a lot easier to use.

Related

C# API for XML structure comparison

is there a way to programmatically compare structures of 2 XML files, but not their values?
More concretely, if you have 2 xml files:
<car>
<numberofwheels>4</numberofwheels>
<carcolor color="red" dateofpainting="2015-10-10" />
</car>
and
<car>
<numberofwheels>7</numberofwheels>
<carcolor color="blue" />
</car>
it would only notice that attribute dateofpainting is missing, but not the change of values (numberofwheels and color). I also don't care about blanks, newlines, attribute order, etc...
There is an XML Diff and Patch Tool from Microsoft, but as far as I can see, it also checks xml values and you cannot set it up to ignore them.

If any one of the structure is predefined,then you can use the XML schema do find the mismatch. If not then you have to traverse the document node by node using XMLDocument/XMLReader class and you can get the difference list.

Flatten XML structure by element with linq to xml

I recently created a post about flattening an XML structure so every element and it's values were turned into attributes on the root element. Got some great answer and got it working. However, sad thing is that by flattening, the client meant to flatten the elements and not make them into attributes :-/
What I have is this:
<members>
<member xmlns="mynamespace" id="1" status="1">
<sensitiveData>
<notes/>
<url>someurl</url>
<altUrl/>
<date1>somedate</date1>
<date2>someotherdate</date2>
<description>some description</description>
<tags/>
<category>some category</category>
</sensitiveData>
<contacts>
<contact contactId="1">
<contactPerson>some contact person</contactPerson>
<phone/>
<mobile>mobile number</mobile>
<email>some#email.com</email>
</contact>
</kontakter>
</member>
</members>
And what I need is the following:
<members>
<member xmlns="mynamespace" id="1" status="1">
<sensitiveData/>
<notes/>
<url>someurl</url>
<altUrl/>
<date1>somedate</date1>
<date2>someotherdate</date2>
<description>some description</description>
<tags/>
<category>some category</category>
<contacts/>
<contact contactId="1"></contact>
<contactPerson>some contact person</contactPerson>
<phone/>
<mobile>mobile number</mobile>
<email>some#email.com</email>
</member>
</members>
So basically all elements, but flattened as childnodes of . I do know that it's not pretty at all to begin parsing XML documents like this, but it's basically the only option left as the CMS we're importing data to requires this flat structure and the XML document comes from an external webservice.
I started to make a recursive method for this, but I've got an odd feeling that it could be made smoother (well, as smooth as possible at least) with some LINQ to XML (?) I'm not the best at linq to xml, so I hope there's someone out there who would be helpful to give a hint on how to solve this? :-)

This seems to work - there may be neater approaches, admittedly:
var doc = XDocument.Load("test.xml");
XNamespace ns = "mynamespace";
var member = doc.Root.Element(ns + "member");
// This will *sort* of flatten, but create copies...
var descendants = member.Descendants().ToList();
// So we need to strip child elements from everywhere...
// (but only elements, not text nodes). The ToList() call
// materializes the query, so we're not removing while we're iterating.
foreach (var nested in descendants.Elements().ToList())
{
nested.Remove();
}
member.ReplaceNodes(descendants);

Adding new elements to XmlDocument that abide by XSD

Currently, I'm adding elements to my XmlDocument using XPath notation for which I've written code to that places the element at the proper location in the file. With one exception. I don't know how to make it pay attention to the sequence rules defined in my XSD file.
Is there a way to add an element to an XmlDocument so that is abides by the sequence define in the XSD that governs my XML file?
For example, my xml document should look like:
<rootTag>
<area name="I define an area">
<description>some text here</description>
<point x="1" y="1" />
<point x="2" y="2" />
<point x="3" y="3" />
</area>
</rootTag>
Yet I get, depending on the order in which the user enters values for the child tags above:
<rootTag>
<area name="I define an area">
<point x="1" y="1" />
<point x="2" y="2" />
<point x="3" y="3" />
<description>some text here</description>
</area>
</rootTag>
To correct the above, I create a DataSet (named tempXmlDataset) from the XSD file. I pass the contents of the XmlDocument into tempXmlDataset and things get re-ordered appropriately.
However, my problem is caused by an option for the first child of the XML document. This option is defined in the XSD to allow for "area", "line" or "point" objects. "area" and "line" both have "point" elements as children. But child "point" is not the same as "point" object. So, as you might already realize, tempXmlDataset.ReadXmlSchema(...) creates a "point" table which only has x and y in it. This is by definition of the children for "area" and "line".
So when my code runs tempXmlDataset.ReadXml(...) the attributes for "point" object do not get read in because it sees "point" object as child "point". Here's an example of "point" object:
<rootTag>
<point name="I define a point" x="3" y="3" otherAttributes="">
<description>some text here</description>
</point>
</rootTag>

Since you tagged this C#, I assume you're on the .NET platform. The System.Xml.Schema would be your best friend. For a program that uses the above API to generate XML, that also comes with source code you could use to understand how to solve your issue, I would use the XmlSampleGenerator.
Generating a sample XML requires exactly what you need in terms of constraining the XPath the user may enter at a given point in time. I believe you will have to constrain the XPath you allow based on where you are in the editing process, right from the beginning, otherwise, one single mistake could make the whole approach useless.
If you don't constrain from the beginning, it might be impossible to try to re-order based on an XSD (please read this also in SO)...

use xsd.exe to generate the required code based on the xsd for classes. Don't try to create the dataset for this case. You can then use the generated code together with the XmlSerializer to produce the needed xml files.
http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx
Also see:
http://msdn.microsoft.com/en-us/library/ms950721.aspx

xpath return string instead of nodelist

I am working on a biztalk project and I need to copy (filtered) content from 1 xml to another.
I have to do this with xpath, I can't use xsl transformation.
So my xpath to get the content from the source xml file is this:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*
Now this returns an xmlNodelist. Is it possible to return a string with all the nodes in it like:
"<root><node>text</node></root>"
If I put string() before my xpath it returns the values, but I want the whole xml in a string (with nodes..), so I could load that string in another xmldocument. I think this is the best method for my problem.
I know I can loop over the xmlnodelist and append the nodes to the new xmldocument, but it's a bit tricky to loop in a biztalk orchestration and I want to avoid this.
The code I can use is C#.
I've tried to just assign the nodelist to the xmldocument, but this throws a cast error (obvious..).
The way I see it is that I have 2 solutions:
assign the nodelist to the xmldocument without a loop (not possible i think in C#)
somehow convert the nodelist to string and load this in the xmldocument
load the xpath directly in the new xmldocument (don't know if this is possible since it returns a nodelist)
Thanks for your help
edit:
sample input:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance
<insurance>
<id>2</id>
<billing></billing>
</insurance>
<insurance>
<id>3</id>
<billing></billing>
</insurance>
</root>
Now I want to copy this sample to another xmldocument, but without insurance node 2 and 3 (this is dynamically, so it could be unsurance node 1 and 2 to delete, or 1 and 3...)
So this has to be the output:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance>
</root>
What I am doing now is use the xpath to get the nodes I want. Then I want to assign the result to the new xmldocument, but this is not possible since I get the castException
string xpath = "//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*";
xmlDoc = new System.Xml.XmlDocument();
xmlDoc = xpath(sourceXml, strXpath); <= cast error (cannot cast xmlnodelist to xmldocuemnt)
I know the syntax is a bit strange, but it is biztalk c# code..

The most straightforward solution would indeed be to "loop over the xmlnodelist and append (import) the nodes to the new xmldocument", but since you can't loop, what other basic things can/can't you do?
To serialize the nodelist, you could try using XmlNodeList.toString(). If that worked, you'd get a strange beast, because it could be duplicating parts of the XML document several times over. Especially since you're explicitly including ancestors and descendants directly in the nodelist. It would not be something that you could parse back in and have a result that resembled the nodelist you started with.
In other words, it would be best to loop over the XmlNodeList and import the nodes to the new XmlDocument.
But even so, I would be really surprised if you wanted to put all these ancestor and descendant nodes:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::
directly into the new XML document. If you post some sample input and the desired output, we can probably help determine if that's the case.
Update:
I see what you're trying to do: copy an XML document, omitting all <insurance> elements (and their descendants) except the one you want.
This can be done without a loop if the output is as simple as your sample output: only one <Patient> and one <insurance> element, with their descendants, under one top-level element.
Something like (I can't test this as I don't have a biztalk server):
string xpathPatient = "/*/Patient";
string xpathInsuran = "/*/insurance[id = " + insId + "]"; // insId is a parameter
xmlDoc = new System.Xml.XmlDocument();
xmlPatient = xpath(sourceXml, xpathPatient);
xmlInsuran = xpath(sourceXml, xpathInsuran);
XmlElement rootNode = xmlDoc.CreateElement("root");
xmlDoc.AppendChild(rootNode);
//**Update: use [0] to get an XmlNode from the returned XmlNodeList (presumably)
rootNode.AppendChild(xmlDoc.ImportNode(xmlPatient[0], true));
rootNode.AppendChild(xmlDoc.ImportNode(xmlInsuran[0], true));
I confess though, I'm curious why you can't use XSLT. You're approaching tasks that would be more easily done in XSLT than in XPath + C# XmlDocument.
Update: since the xpath() function probably returns an XmlNodeList rather than an XmlNode, I added [0] to the first argument to ImportNode() above. Thanks to #Martin Honnen for alerting me to that.

XPath is a query language (only) for XML documents.
It operates on an abstract model -- the XML INFOSET, and cannot either modify the structure of the XML document(s) it operates on or serialize the INFOSET information items back to XML.
Therefore, the only way to achieve such serialization is to use the language that is hosting XPath.
Apart from this, there are obvious problems with yout question, for example these is no element named IN1_Insurance in the provided XML document -- therefore the XPath expression provided:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*
selects all elements in the document.
Note:
The described task is elementary to fulfil using XSLT.
Finally: If you are allowed to use C# then you can use the XslCompiledTransform (or XslTransform) class. Use its Transform() method to carry out the following transformation against the XML document:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="insurance[not(id=1)]"/>
</xsl:stylesheet>
This produces exactly the wanted result:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance>
</root>

How to embed xml in xml

I need to embed an entire well-formed xml document within another xml document. However, I would rather avoid CDATA (personal distaste) and also I would like to avoid the parser that will receive the whole document from wasting time parsing the embedded xml. The embedded xml could be quite significant, and I would like the code that will receive the whole file to treat the embedded xml as arbitrary data.
The idea that immediately came to mind is to encode the embedded xml in base64, or to zip it. Does this sound ok?
I'm coding in C# by the way.

You could convert the XML to a byte array, then convert it to binary64 format. That will allow you to nest it in an element, and not have to use CDATA.

The W3C-approved way of doing this is XInclude. There is an implementation for .Net at http://mvp-xml.sourceforge.net/xinclude/

Just a quick note, I have gone the base64 route and it works just fine but it does come with a stiff performance penalty, especially under heavy usage. We do this with document fragments upto 20MB and after base64 encoding they can take upwards of 65MB (with tags and data), even with zipping.
However, the bigger issue is that .NET base64 encoding can consume up-to 10x the memory when performing the encoding/decoding and can frequently cause OOM exceptions if done repeatedly and/or done on multiple threads.
Someone, on a similar question recommended ProtoBuf as an option, as well as Fast InfoSet as another option.

Depending on how you construct the XML, one way is to not care about it and let the framework handle it.
XmlDocument doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\" encoding=\"utf-8\" ?><helloworld></helloworld>");
string xml = "<how><are><you reply=\"i am fine\">really</you></are></how>";
doc.GetElementsByTagName("helloworld")[0].InnerText = xml;
The output will be something like a HTMLEncoded string:
<?xml version="1.0" encoding="utf-8"?>
<helloworld><how><are><you
reply="i am fine">really</you></are></how>
</helloworld>

I would encode it in your favorite way (e.g. base64 or HttpServerUtility::UrlEncode, ...) and then embed it.

If you don't need the xml declaration (first line of the document), just insert the root element (with all childs) into the tree of the other xml document as a child of an existing element. Use a different namespace to seperate the inserted elements.

It seems that serialization is the recommended method.

Can't you use XSLT for this? Perhaps using xsl:copy or xsl:copy-of? This is what XSLT is for.

I use Comments for this :
<!-- your xml text -->
[EDITED]
If the embedded xml with comments, replace it with a different syntax.
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status code="0" msg="" cause="" />
<data>
<order type="07" user="none" attrib="..." >
<xmlembeded >
<!--
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status ret="000 "/>
<data>
<allxml_here />
<!** embedeb comments **>
</data>
<xml>
-->
</xmlembeded >
</order>
<context sessionid="12345678" scriptname="/from/..." attrib="..." />
</data>
</xml>

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XML Parsing - Removing nodes and keeping the overall structure - c#

I would go the opposite way: remove what you don't need. It's hard to build an XmlDocument from scratch (takes a lot of coding).

Related

C# API for XML structure comparison

Flatten XML structure by element with linq to xml

Adding new elements to XmlDocument that abide by XSD

xpath return string instead of nodelist

How to embed xml in xml

Categories

Resources