I am refactoring some code in an existing system. The goal is to remove all instances of the XmlDocument to reduce the memory footprint. However, we use XPath to manipulate the xml when certain rules apply. Is there a way to use XPath without using a class that loads the entire document into memory? We've replaced all other instances with XmlTextReader, but those only worked because there is no XPath and the reading is very simple.
Some of the XPath uses values of other nodes to base its decision on. For instance, the value of the message node may be based on the value of the amount node, so there is a need to access multiple nodes at one time.
If your XPATH expression is based on accessing multiple nodes, you're just going to have to read the XML into a DOM. Two things, though. First, you don't have to read all of it into a DOM, just the part you're querying. Second, which DOM you use makes a difference; XPathDocument is read-only and tuned for XPATH query speed, unlike the more general purpose but expensive XmlDocument.
I supose that using System.Xml.Linq.XDocument is also prohibited? Otherwise, it would be a good choice, as it is faster than XmlDocument (as I remember).
Supporting XPath means supporting queries like:
//address[/states/state[#code=current()/#code]='California']
or
//item[#id != preceding-sibling/item/#id]
which require the XPath processor to be able to look everywhere in the document. You're not going to find a forward-only XPath processor.
The way to do this is to use XPathDocument, which can take a stream - therefore you can use StringReader.
This returns the value in a forward read way without the overhead of loading the whole XML DOM into memory with XmlDocument.
Here is an example which returns the value of the first node that satisfies the XPath query:
public string extract(string input_xml)
{
XPathDocument document = new XPathDocument(new StringReader(input_xml));
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator node_iterator = navigator.Select(SEARCH_EXPRESSION);
node_iterator.MoveNext();
return node_iterator.Current.Value;
}
Related
I have an application that has to load XML document and output nodes depending on XPath.
Suppose I start with a document like this:
<aaa>
...[many nodes here]...
<bbb>text</bbb>
...[many nodes here]...
<bbb>text</bbb>
...[many nodes here]...
</aaa>
With XPath //bbb
So far everything is nice.
And selection doc.SelectNodes("//bbb"); returns the list of required nodes.
Then someone uploads a document with one node like <myfancynamespace:foo/> and extra namespace in the root tag, and everything breaks.
Why? //bbb does not give a damn about myfancynamespace, theoretically it should even be good with //myfancynamespace:foo, as there is no ambiguity, but the expression returns 0 results and that's it.
Is there a workaround for this behavior?
I do have a namespace manager for the document, and I am passing it to the Xpath query. But the namespaces and the prefixes are unknown to me, so I can't add them before the query.
Do I have to pre-parse the document to fill the namespace manager before I do any selections? Why on earth such behavior, it just doesn't make sense.
EDIT:
I'm using:
XmlDocument and XmlNamespaceManager
EDIT2:
XmlDocument doc = new XmlDocument();
doc.XmlResolver = null;
XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
//I wish I could:
//nsmgr.AddNamespace("magic", "http://magicnamespaceuri/
//...
doc.LoadXML(usersuppliedxml);
XmlNodeList nodes = doc.SelectNodes(usersuppliedxpath, nsmgr);//usersuppliedxpath -> "//bbb"
//nodes.Count should be > 0, but with namespaced document they are 0
EDIT3:
Found an article which describes the actual scenario of the issue with one workaround, but not very pretty workaround: http://codeclimber.net.nz/archive/2008/01/09/How-to-query-a-XPath-doc-that-has-a-default.aspx
Almost seems that stripping the xmlns is the way to go...
You're missing the whole point of XML namespaces.
But if you really need to perform XPath on documents that will use an unknown namespace, and you really don't care about it, you will need to strip it out and reload the document. XPath will not work in a namespace-agnostic way, unless you want to use the local-name() function at every point in your selectors.
private XmlDocument StripNamespace(XmlDocument doc)
{
if (doc.DocumentElement.NamespaceURI.Length > 0)
{
doc.DocumentElement.SetAttribute("xmlns", "");
// must serialize and reload for this to take effect
XmlDocument newDoc = new XmlDocument();
newDoc.LoadXml(doc.OuterXml);
return newDoc;
}
else
{
return doc;
}
}
<myfancynamespace:foo/> is not necessarily the same as <foo/>.
Namespaces do matter. But I can understand your frustration as they usually tend to breaks codes as various implementation (C#, Java, ...) tend to output it differently.
I suggest you change your XPath to allow for accepting all namespaces. For example instead of
//bbb
Define it as
//*[local-name()='bbb']
That should take care of it.
You should describe a bit more detailed what you want to do. The way you ask your question it make no sense at all. The namespace is just a part of the name. Nothing more, nothing less. So your question is the same as asking for an XPath query to get all tags ending with "x". That's not the idea behind XML, but if you have strange reasons to do so: Feel free to iterate over all nodes and implement it yourself. The same applies to functionality you are requesting.
You could use the LINQ XML classes like XDocument. They greatly simplify working with namespaces.
I'm writing an XML document based on a stream of data. This part has been accomplished using the XmlTextWriter and the XElement classes.
Now when I come to read in the document I want to be able to 'delay-load' the XML document so that certain nodes are skipped (i.e. the ones which contain large binary chunks.) and then load them when required.
Is this possible using the XmlDocument class? Or will I have to do things in a more manual way using the XmlTextReader class.
Thanks.
Nick.
Not possible with XmlDocument as the whole document needs to be loaded onto memory before parsed as tree.
XmlTextReader/SAX is the standard solution.
This is not possible with either XmlDocument or XDocument.
note that if you want to use XmlTextReader, it is fwd only. i.e. once youhave skipped it, you cant come back to it.
see MSDN on this
I am now learning XmlDocument but I've just ran into XDocument and when I try to search the difference or benefits of them I can't find something useful, could you please tell me why you would use one over another ?
If you're using .NET version 3.0 or lower, you have to use XmlDocument aka the classic DOM API. Likewise you'll find there are some other APIs which will expect this.
If you get the choice, however, I would thoroughly recommend using XDocument aka LINQ to XML. It's much simpler to create documents and process them. For example, it's the difference between:
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("root");
root.SetAttribute("name", "value");
XmlElement child = doc.CreateElement("child");
child.InnerText = "text node";
root.AppendChild(child);
doc.AppendChild(root);
and
XDocument doc = new XDocument(
new XElement("root",
new XAttribute("name", "value"),
new XElement("child", "text node")));
Namespaces are pretty easy to work with in LINQ to XML, unlike any other XML API I've ever seen:
XNamespace ns = "http://somewhere.com";
XElement element = new XElement(ns + "elementName");
// etc
LINQ to XML also works really well with LINQ - its construction model allows you to build elements with sequences of sub-elements really easily:
// Customers is a List<Customer>
XElement customersElement = new XElement("customers",
customers.Select(c => new XElement("customer",
new XAttribute("name", c.Name),
new XAttribute("lastSeen", c.LastOrder)
new XElement("address",
new XAttribute("town", c.Town),
new XAttribute("firstline", c.Address1),
// etc
));
It's all a lot more declarative, which fits in with the general LINQ style.
Now as Brannon mentioned, these are in-memory APIs rather than streaming ones (although XStreamingElement supports lazy output). XmlReader and XmlWriter are the normal ways of streaming XML in .NET, but you can mix all the APIs to some extent. For example, you can stream a large document but use LINQ to XML by positioning an XmlReader at the start of an element, reading an XElement from it and processing it, then moving on to the next element etc. There are various blog posts about this technique, here's one I found with a quick search.
I am surprised none of the answers so far mentions the fact that XmlDocument provides no line information, while XDocument does (through the IXmlLineInfo interface).
This can be a critical feature in some cases (for example if you want to report errors in an XML, or keep track of where elements are defined in general) and you better be aware of this before you happily start to implement using XmlDocument, to later discover you have to change it all.
XmlDocument is great for developers who are familiar with the XML DOM object model. It's been around for a while, and more or less corresponds to a W3C standard. It supports manual navigation as well as XPath node selection.
XDocument powers the LINQ to XML feature in .NET 3.5. It makes heavy use of IEnumerable<> and can be easier to work with in straight C#.
Both document models require you to load the entire document into memory (unlike XmlReader for example).
As mentioned elsewhere, undoubtedly, Linq to Xml makes creation and alteration of xml documents a breeze in comparison to XmlDocument, and the XNamespace ns + "elementName" syntax makes for pleasurable reading when dealing with namespaces.
One thing worth mentioning for xsl and xpath die hards to note is that it IS possible to still execute arbitrary xpath 1.0 expressions on Linq 2 Xml XNodes by including:
using System.Xml.XPath;
and then we can navigate and project data using xpath via these extension methods:
XPathSelectElement - Single Element
XPathSelectElements - Node Set
XPathEvaluate - Scalars and others
For instance, given the Xml document:
<xml>
<foo>
<baz id="1">10</baz>
<bar id="2" special="1">baa baa</bar>
<baz id="3">20</baz>
<bar id="4" />
<bar id="5" />
</foo>
<foo id="123">Text 1<moo />Text 2
</foo>
</xml>
We can evaluate:
var node = xele.XPathSelectElement("/xml/foo[#id='123']");
var nodes = xele.XPathSelectElements(
"//moo/ancestor::xml/descendant::baz[#id='1']/following-sibling::bar[not(#special='1')]");
var sum = xele.XPathEvaluate("sum(//foo[not(moo)]/baz)");
XDocument is from the LINQ to XML API, and XmlDocument is the standard DOM-style API for XML. If you know DOM well, and don't want to learn LINQ to XML, go with XmlDocument. If you're new to both, check out this page that compares the two, and pick which one you like the looks of better.
I've just started using LINQ to XML, and I love the way you create an XML document using functional construction. It's really nice. DOM is clunky in comparison.
Also, note that XDocument is supported in Xbox 360 and Windows Phone OS 7.0.
If you target them, develop for XDocument or migrate from XmlDocument.
I believe that XDocument makes a lot more object creation calls. I suspect that for when you're handling a lot of XML documents, XMLDocument will be faster.
One place this happens is in managing scan data. Many scan tools output their data in XML (for obvious reasons). If you have to process a lot of these scan files, I think you'll have better performance with XMLDocument.
I have tons of XML files all containing a the same XML Document, but with different values. But the structure is the same for each file.
Inside this file I have a datetime field.
What is the best, most efficient way to query these XML files? So I can retrieve for example... All files where the datetime field = today's date?
I'm using C# and .net v2. Should I be using XML objects to achieve this or text in file search routines?
Some code examples would be great... or just the general theory, anything would help, thanks...
This depends on the size of those files, and how complex the data actually is. As far as I understand the question, for this kind of XML data, using an XPath query and going through all the files might be the best approach, possibly caching the files in order to lessen the parsing overhead.
Have a look at:
XPathDocument, XmlDocument classes and XPath queries
http://support.microsoft.com/kb/317069
Something like this should do (not tested though):
XmlNamespaceManager nsmgr = new XmlNamespaceManager(new NameTable());
// if required, add your namespace prefixes here to nsmgr
XPathExpression expression = XPathExpression.Compile("//element[#date='20090101']", nsmgr); // your query as XPath
foreach (string fileName in Directory.GetFiles("PathToXmlFiles", "*.xml")) {
XPathDocument doc;
using (XmlTextReader reader = new XmlTextReader(fileName, nsmgr.NameTable)) {
doc = new XPathDocument(reader);
}
if (doc.CreateNavigator().SelectSingleNode(expression) != null) {
// matching document found
}
}
Note: while you can also load a XPathDocument directly from a URI/path, using the reader makes sure that the same nametable is being used as the one used to compile the XPath query. If a different nametable was being used, you'd not get results from the query.
You might look into running XSL queries. See also XSLT Tutorial, XML transformation using Xslt in C#, How to query XML with an XPath expression by using Visual C#.
This question also relates to another on Stack Overflow: Parse multiple XML files with ASP.NET (C#) and return those with particular element. The accepted answer there, though, suggests using Linq.
If it is at all possible to move to C# 3.0 / .NET 3.5, LINQ-to-XML would be by far the easiest option.
With .NET 2.0, you're stuck with either XML objects or XSL.
I'm working on a project for school that involves a heavy amount of XML Parsing. I'm coding in C#, but I have yet to find a "suitable" method of parsing this XML out. There's several different ways I've looked at, but haven't gotten it right yet; so I have come to you. Ideally, I'm looking for something kind of similar to Beautiful Soup in Python (sort of).
I was wondering if there was any way to convert XML like this:
<config>
<bgimg>C:\\background.png</bgimg>
<nodelist>
<node>
<oid>012345</oid>
<image>C:\\image.png</image>
<label>EHRV</label>
<tooltip>
<header>EHR Viewer</header>
<body>Version 1.0</body>
<icon>C:\\ico\ehrv.png</icon>
</tooltip>
<msgSource>8181:iqLog</msgSource>
</nodes>
</nodeList>
<config>
Into an Array/Hastable/Dictionary/Other like this:
Array
(
["config"] => array
(
["bgimg"] => "C:\\background.png"
["nodelist"] => array
(
["node"] => array
(
["oid"] => "012345"
["image"] => "C:\\image.png"
["label"] => "Version 1.0"
["tooltip"] => array
(
["header"] => "EHR Viewer"
["body"] => "Version 1.0"
["icon"] => "C:\\ico\ehrv.png"
)
["msgSource"] => "8181:iqLog"
)
)
)
)
Even just giving me a decent resource to look through would be really helpful. Thanks a ton.
I would look into Linq to Xml. This gives you an object structure similar to the Xml file that is fairly easy to traverse.
XmlDocument + XPath is pretty much all you ever need in .NET to parse XML.
There must be 1/2 dozen different ways to do this in C#. My favorite uses the System.Xml namespace, particularly System.Xml.Serialization.
You use a command line tool called xsd.exe to turn an xml sample into an xsd schema file (tip: make sure your nodelist has more than one node in the sample), and then use it again on the schema to turn that into a C# class file you can load into your project and easily use with the System.Xml.Serialization.XmlSerializer class.
There's no shame in using an old-fashioned XmlDocument:
var xml = "<config>hello world</config>";
var doc = new System.Xml.XmlDocument();
doc.LoadXml(xml);
var nodes = doc.SelectNodes("/config");
You should defiantly use LINQ to XML, A.K.A. XLINQ. There is a nice tool called LINQPad that you should check out. It has nice features, from a comprehensive examples library to allowing you to directly query an SQL database via Linq to SQL. Best of all, it lets you test your queries before putting them into code.
The best approach will be dictated by what you actually want to do with the data once you've parsed it out.
If you want to pass it around in a structured-but-not-tied-to-XML fashion, XML Serialization is probably your best bet. This will also get you closest to what you've described, though you'll be dealing with an object graph rather than nested maps.
If you are just looking for a convenient format to query for specific bits of data, your best option would be LINQ to Xml. Alternatively, you could use the more traditional classes in the System.Xml namespace (starting with XmlDocument) and query using XPath.
You could also use any of these techniques (or an XmlTextReader) as building blocks to create the datastructure you've described but, barring some special need, I don't think it'll give you any more versatility than what the other approaches will.
You can also use serialization to convert the XML text back into a strongly typed class instance.
I personally like to map XML elements to classes and viceversa using System.Xml.Serialization.XmlSerializer class.
http://msdn.microsoft.com/es-es/library/system.xml.serialization.xmlserializer(VS.80).aspx
I personally use XPathDocument, XPathNavigator and XPathNodeIterator e.g.
XPathDocument xDoc = new XPathDocument(CHOOSE SOURCE!);
XPathNavigator xNav = xDoc.CreateNavigator();
XPathNodeIterator iterator = xNav.Select("nodes/node[#SomePredicate = 'SomeValue']");
while (iterator.MoveNext())
{
string val = iterator.Current.SelectSingleNode("nodeWithValue");
// etc etc
}
Yeah, i agree..
The linq-way is very nice.
And i especially like the way you write XML using it.
It is much more simple using the "objects in objects"-way.