I'm in the position to parse XML in .NET. Now I have the choice between at least XmlTextReader and XDocument. Are there any comparisons between those two (or any other XML parsers contained in the framework)?
Maybe this could help me to decide without trying both of them in depth.
The XML files are expected to be rather small, speed and memory usage are a minor issue compared to easiness of use. :-)
(I'm going to use them from C# and/or IronPython.)
Thanks!
If you're happy reading everything into memory, use XDocument. It'll make your life much easier. LINQ to XML is a lovely API.
Use an XmlReader (such as XmlTextReader) if you need to handle huge XML files in a streaming fashion, basically. It's a much more painful API, but it allows streaming (i.e. only dealing with data as you need it, so you can go through a huge document and only have a small amount in memory at a time).
There's a hybrid approach, however - if you have a huge document made up of small elements, you can create an XElement from an XmlReader positioned at the start of the element, deal with the element using LINQ to XML, then move the XmlReader onto the next element and start again.
XmlTextReader is kind of deprecated, do not use it.
From msdn blogs by XmlTeam
Effective Xml Part 1: Choose the right API
Avoid using XmlTextReader. It contains quite a few bugs that could not be fixed without breaking existing applications already using it.
The world has moved on, have you? Xml APIs you should avoid using.
Obsolete APIs are easy since the compiler helps identifying them but there are two more APIs you should avoid using – namely XmlTextReader and XmlTextWriter. We found a number of bugs in these classes which we could not fix without breaking existing applications. The easy route would be to deprecate these classes and ask people to use replacement APIs instead. Unfortunately these two classes cannot be marked as obsolete because they are part of ECMA-335 (Common Language Infrastructure) standard (http://www.ecma-international.org/publications/standards/Ecma-335.htm) – the companion CLILibrary.xml file which is a part of Partition IV).
The good news is that even though these classes are not deprecated there are replacement APIs for these in .NET Framework already and moving to them is relatively easy. First it is necessary to find the places where XmlTextReader or XmlTextWriter is being used (unfortunately it is a manual step). Now all the occurrences of XmlTextReader should be replaced with XmlReader and all the occurrences of XmlTextWriter should be replaced with XmlWriter (note that XmlTextReader derives from XmlReader and XmlTextWriter derives from XmlWriter so the app can already be using these e.g. as formal parameters). The last step is to change the way the XmlReader/XmlWriter objects are instantiated – instead of creating the reader/writer directly it is necessary to the static factory method .Create() present on both XmlReader and XmlWriter APIs.
Furthermore, intellisense in Visual Studio doesn't list XmlTextReader under System.Xml namespace. The class is defined as:
[EditorBrowsable(EditorBrowsableState.Never)]
public class XmlTextReader : XmlReader, IXmlLineInfo, IXmlNamespaceResolver
The XmlReader.Create factory methods return other internal implementations of the abstract class XmlReader depending on the settings passed.
For forward-only streaming API (i.e. that doesn't load the entire thing into memory), use XmlReader via XmlReader.Create method.
For an easier API to work with, go for XDocument aka LINQ To XML. Find XDocument vs XmlDocument here and here.
Related
I was using XmlSerializer when I came across someone using XmlTextWriter.
What is the difference between those two?
To me, they serve the same function which is to create XML files. Microsoft website said that XmlTextWriter provides a fast, non-cached, forward-only way of generating streams but I don't really know what that means.
The XmlTextWriter class is an object that knows XML. You can use it to generate arbitrary XML documents. It doesn't matter where the data's coming from; you can pull data for XML elements, attributes, and contents along with the actual structure of the XML document from whatever source you see fit, and it doesn't need to match any particular object's structure or data.
On the other hand XmlSerializer is an object that knows types. It has the features necessary to analyze a type, extract the important information, and write that information out. It happens to be able to use an XmlTextWriter object to perform the actual I/O; you can provide your own, or at some level it will always create a similar object to handle the actual I/O. In other words, the serializer object doesn't really know XML per se, nor does it need to. It delegates that work to another object.
Microsoft website said that XmlTextWriter provides a fast, non-cached, forward-only way of generating streams but I don't really know what that means.
"fast": not slow
"non-cached": important pieces of information are not stored in memory longer than absolutely necessary
"forward-only": you cannot revisit parts of the XML document you've already created
That is in contrast to other methods for generating XML documents in which the entire document structure is held in memory as its constructed, and written to a file only once the entire document has been constructed. This is often described as the "document object model", or DOM.
The writer approach tends to be more efficient in performance because the XML data is being generated on the fly, as needed, directly from other in-memory data structures you already have. Because the DOM approach requires the entire file's data and structure to be represented in memory at once, it will usually use more memory, which in some cases can reduce performance (though, frankly, on modern computers and for typical XML documents, this is usually a complete non-issue).
I have a complex graph of XML-serializable classes that I'm able to (de)serialize to hard-disk just fine. But how do I handle massive changes to the graph schema structure? Is there some mechanism to handle XML schema upgrades? Some classes that would allow me to migrate old data to the new format?
Sure I could just use XmlReader/XmlWriter, go through every node and attribute and write several thousand lines of code to convert data to the new format, but maybe there is a better way?
I have found Object graph serialization in .NET and code version upgrades, but I don't think the linked articles apply when there are major changes in the model.
Instead of writing several thousand lines of code to convert files using XmlReader / XmlWriter, you could use XSLT. We are still talking hundreds of lines of code, and perhaps slower execution speeds, but if you are good at XSLT you could get it done much faster.
The other approach would be to build a C# program that links both the old class and the new class (of course you'd need to rename the old class to avoid naming collision). The program would load OldMyClass from disk, construct NewMyClass from the values of its attributes, and serialize NewMyClass to disk. Essentially, this approach moves the task of conversion into the C# territory, which may be a lot more familiar to you.
In this case, i keep my changes in my object and recreate my xml through the XmlSerializer: http://support.microsoft.com/kb/815813
With this i load and save new xml schema based in my object.
I'm having to use .NET 2.0 so can't use any of the nice XDocument stuff.
I'm wondering if anyone has seen any helper/utility methods that still use XmlDocument but make xml creation a bit less tedious?
You could look at the XmlHandler class in Pluto.
It uses XmlDocument internally, but allows very simple reading and writing of values, including handling arrays, classes, etc, with reading and writing to specific locations via XPath queries.
I write a desktop application that can open / edit / save documents.
Those documents are described by several objects of different types that store references to each other. Of course there is a Document class that that serves as the root of this data structure.
The question is how to save this document model into a file.
What I need:
Support for recursive structures.
It must be able to open files even if they were produced from slightly different classes. My users don't want to recreate every document after every release just because I added a field somewhere.
It must deal with classes that are not known at compile time (for plug-in support).
What I tired so far:
XmlSerializer -> Fails the first and last criteria.
BinarySerializer -> Fails the second criteria.
DataContractSerializer: Similar to XmlSerializer but with support for cyclic (recursive) references. Also it was designed with (forward/backward) compatibility in mind: Data Contract Versioning. [edit]
NetDataContractSerializer: While the DataContractSerializer still requires to know all types in advance (i.e. it can't work very well with inheritance), NetDataContractSerializer stores type information in the output. Other than that the two seem to be equivalent. [edit]
protobuf-net: Didn't have time to experiment with it yet, but it seems similar in function to DataContractSerializer, but using a binary format. [edit]
Handling of unknown types [edit]
There seem two be two philosophies about what to do when the static and dynamic type differ (if you have a field of type object but a, lets say, Person-object in it). Basically the dynamic type must somehow get stored in the file.
Use different XML tags for different dynamic types. But since the XML tag to be used for a particular class might not be equal to the class name, its only possible to go this route if the deserializer knows all possible types in advance (so that he can scan them for attributes).
Store the CLR type (class name, assembly name & version) during serialization. Use this info during deserialization to instantiate the right class. The types must not be known prior to deserialization.
The second one is simpler to use, but the resulting file will be CLR dependent (and less sensitive to code modifications). Thats probably why XmlSerializer and DataContractSerializer choose the first way. NetDataContractSerializer is not recomended because its using the second approch (So does BinarySerializer by the way).
Any ideas?
The one you haven't tried is DataContractSerializer. There is a constructor that takes a parameter bool preserveObjectReferences that should handle the first criteria.
The WCF data contract serializer is probably closest to your needs, although not perfect.
There is only limited support for backwards compatibility (i.e. whether old versions of the program can read documents generated with a newer version). New fields are supported (via IExtensibleDataObject), but new classes or new enum values not.
I would think the XmlSerializer is your best bet. You won't be able to support everything on your requirements list without a bit of work in your Document classes - but the XmlSerializer architecture gives you extensibility points which should allow you to tap into its mechanism deep enough to do just about anything.
Using the IXmlSerializable interface - by implementing that on your classes you want to store - you should be able to do just about anything, really.
The interface exposes basically two methods - ReadXml And WriteXml
public void WriteXml (XmlWriter writer)
{
// do what you need to do to write out your XML for this object
}
public void ReadXml (XmlReader reader)
{
// do what you need to do to read your object from XML
}
Using these two methods, you should be able to capture the necessary state information from just about any object you might want to store, and turn it into XML that can be persisted to disk - and deserialized back into an object when the time comes!
XmlSerializer can work for your first criteria, however you must provide the recursion for objects like the TreeView control.
BinaryFormatter can work for all 3 criteria. If a class changes, you may have to create a conversion tool to convert old format documents to a new format. Or recognize an older format, deserialize to the old, and then save to the new - keeping your old class format around for a little while.
This will help cover version tolerance which is what I think you're after: MSDN - Version Tolerant Serialization
Does anyone know what advantages (memory/speed) there are by using a class generated by the XSD tool to explore a deserialized XML file as opposed to XPATH?
I'd say the advantage is that you get a strongly typed class which is more convenient to use, and also the constructor for the class will throw an exception if the XML data in the file is invalid for creating the object, so you get a minimal data validation for free.
If you don't want to write boilerplate code, and you need to check ANY values of your XML on the way through, you can't go wrong with the XSD.exe generated classes.
The two are very different; but XmlSerializer will always deserialize entire objects; with XPath you can pick and choose. I'd use XmlSerializer personally, though - harder to get wrong.
XPath, however, is a complex beast that depends on the back-end. For example, XmlDocument (mutable) will behave differently to XPathDocument (read-only, optimized for query).