Serialization for document storage

Serialization for document storage - c#

I write a desktop application that can open / edit / save documents.
Those documents are described by several objects of different types that store references to each other. Of course there is a Document class that that serves as the root of this data structure.
The question is how to save this document model into a file.
What I need:
Support for recursive structures.
It must be able to open files even if they were produced from slightly different classes. My users don't want to recreate every document after every release just because I added a field somewhere.
It must deal with classes that are not known at compile time (for plug-in support).
What I tired so far:
XmlSerializer -> Fails the first and last criteria.
BinarySerializer -> Fails the second criteria.
DataContractSerializer: Similar to XmlSerializer but with support for cyclic (recursive) references. Also it was designed with (forward/backward) compatibility in mind: Data Contract Versioning. [edit]
NetDataContractSerializer: While the DataContractSerializer still requires to know all types in advance (i.e. it can't work very well with inheritance), NetDataContractSerializer stores type information in the output. Other than that the two seem to be equivalent. [edit]
protobuf-net: Didn't have time to experiment with it yet, but it seems similar in function to DataContractSerializer, but using a binary format. [edit]
Handling of unknown types [edit]
There seem two be two philosophies about what to do when the static and dynamic type differ (if you have a field of type object but a, lets say, Person-object in it). Basically the dynamic type must somehow get stored in the file.
Use different XML tags for different dynamic types. But since the XML tag to be used for a particular class might not be equal to the class name, its only possible to go this route if the deserializer knows all possible types in advance (so that he can scan them for attributes).
Store the CLR type (class name, assembly name & version) during serialization. Use this info during deserialization to instantiate the right class. The types must not be known prior to deserialization.
The second one is simpler to use, but the resulting file will be CLR dependent (and less sensitive to code modifications). Thats probably why XmlSerializer and DataContractSerializer choose the first way. NetDataContractSerializer is not recomended because its using the second approch (So does BinarySerializer by the way).
Any ideas?

The one you haven't tried is DataContractSerializer. There is a constructor that takes a parameter bool preserveObjectReferences that should handle the first criteria.

The WCF data contract serializer is probably closest to your needs, although not perfect.
There is only limited support for backwards compatibility (i.e. whether old versions of the program can read documents generated with a newer version). New fields are supported (via IExtensibleDataObject), but new classes or new enum values not.

I would think the XmlSerializer is your best bet. You won't be able to support everything on your requirements list without a bit of work in your Document classes - but the XmlSerializer architecture gives you extensibility points which should allow you to tap into its mechanism deep enough to do just about anything.
Using the IXmlSerializable interface - by implementing that on your classes you want to store - you should be able to do just about anything, really.
The interface exposes basically two methods - ReadXml And WriteXml
public void WriteXml (XmlWriter writer)
{
// do what you need to do to write out your XML for this object
}
public void ReadXml (XmlReader reader)
{
// do what you need to do to read your object from XML
}
Using these two methods, you should be able to capture the necessary state information from just about any object you might want to store, and turn it into XML that can be persisted to disk - and deserialized back into an object when the time comes!

XmlSerializer can work for your first criteria, however you must provide the recursion for objects like the TreeView control.
BinaryFormatter can work for all 3 criteria. If a class changes, you may have to create a conversion tool to convert old format documents to a new format. Or recognize an older format, deserialize to the old, and then save to the new - keeping your old class format around for a little while.
This will help cover version tolerance which is what I think you're after: MSDN - Version Tolerant Serialization

Related

Why SerializationAttribute is not applied by default? [duplicate]

Based on my understanding, SerializableAttribute provides no compile time checks, as it's all done at runtime. If that's the case, then why is it required for classes to be marked as serializable?
Couldn't the serializer just try to serialize an object and then fail? Isn't that what it does right now? When something is marked, it tries and fails. Wouldn't it be better if you had to mark things as unserializable rather than serializable? That way you wouldn't have the problem of libraries not marking things as serializable?

As I understand it, the idea behind the SerializableAttribute is to create an opt-in system for binary serialization.
Keep in mind that, unlike XML serialization, which uses public properties, binary serialization grabs all the private fields by default.
Not only this could include operating system structures and private data that is not supposed to be exposed, but deserializing it could result in corrupt state that can crash an application (silly example: a handle for a file open in a different computer).

This is only a requirement for BinaryFormatter (and the SOAP equivalent, but nobody uses that). Diego is right; there are good reasons for this in terms of what it does, but it is far from the only option - indeed, personally I only recommend BinaryFormatter for talking between AppDomains - it is not (IMO) a good way to persist data (to disk, in cache, to a database BLOB, etc).
If this behaviour causes you trouble, consider using any of the alternatives:
XmlSerializer, which works on public members (not just the fields), but demands a public parameterless constructor and public type
DataContractSerializer, which can work fully opt-in (using [DataContract]/[DataMember]), but which can also (in 3.5 and above) work against the fields instead
Also - for a 3rd-party option (me being the 3rd party); protobuf-net may have options here; "v2" (not fully released yet, but available as source) allows the model (which members to serialize, etc) to be described independently of the type, so that it can be applied to types that you don't control. And unlike BinaryFormatter the output is version-tolerant, known public format, etc.

serialize and deserialize objects from different location

I have two separate programs that need to share information. This sharing will be done by one app placing an XML serialized object in a database, and the other app retrieving it on a different machine. The objects share the same variables but the properties and methods are different.
How exact do the classes have to match between the two programs?
Is the match line by line or just variable, property, and method names?
I ended up using the Newtonsoft.Json library instead of xml and used the <JsonObject(MemberSerialization.OptIn)> and JsonProperty() attributes to control what got serialized.

You did not specify which kind of serialization you were after.
The standard NET binary serializer is not well suited for data exchange between 2 different assemblies. When you go to deserialize, you'll get an an error similar to [Culture].[Assembly].[Version].SourceClass cannot be deserialized to [Culture].[Assembly].[Version].DestClass. This will happen even if the classes are identical.
There are several ways around this. A) Use the same service DLL on both sides to do the serializing B) trick it into deserializing by using an override to report a matching Culture-Assembly-Version-Class, but that seems dodgy or C) use XML serialization, but that makes for very wordy output, which is also readable.
For Binary Serialization, rather than the NET binary formatter, there is ProtoBuf-NET which is faster, produces much smaller output and uses nearly identical syntax.
How exact do the classes have to match between the two programs
ProtoBuf uses a numeric index rather than property name, so they shouldn't have to be too similar. Of course there has to be some similarity or the destination may not have a clue what the data represents. The code in the class can be quite different because it stays put.

Serialization stores only the data for an object - member variables, properties, etc. As long as the data types are compatible, it should work. You do not need a line by line match for the functions.

It all depends on the serializer you are using. Some require a perfect match, others tend to be more loosely coupled to the objects.
How exact do the classes have to match between the two programs?
Well, not at all. But they should be similar in some way because otherwise the serialization doesn't make sense.
Is the match line by line or variables and method names?
As, stated above: there must be some overlap. Usually the property names must be the same. But of course you can also provide a custom mapping.
Take a look at the Newtonsoft library, u can use it (for json) like this:
JsonConvert.DeserializeObject<IEnumerable<Unit>>(result);
It's independent of the object method that serialized the string.

Should I use a Namespace of an XML file to identify its version

I'm using DataContractSerializer to serialize a class with DataContract and DataMember attributes to an XML file. My class could potentially change later, and thus the format of the serialized files could also change. I'd like to tag the files I'm saving with a version number so I at least know what version each file is from. I'm still deciding how and if I want to add functionality that will migrate files in older formats to later formats. But right now I'd be happy with just identifying a version mismatch.
Is the namespace of the XML file the correct place to store the version of the file? I was thinking of attributing my class with a DataContract attributes as follows.
[DataContract(Name="MyClass",Namespace="http://www.mycompany.com/MyProject/1.0
public class MyClass
...
Then later if MyClass changes I would change the namespace...
[DataContract(Name="MyClass",Namespace="http://www.mycompany.com/MyProject/2.0)]
public class MyClass
...
Is this the correct usage of XML namespaces, or is there another more prefered way to save the version of an XML file?

You can do it this way, but then the XML representation of your data becomes completely different from version to version from XML Infoset point of view (in which namespace is the part of the qualified name of the element), so you have neither backwards nor forwards compatibility.
Now, one advantage XML has is that it can be easily processed in a forward-compatible way with technologies such as XPath and XSLT - you just pick the elements you can interpret, and leave anything you don't recognize as is. But this requires elements with the same meaning to retain the same name (including namespace) between versions.
In general, it is best to make your schemas forward-compatible. If you can't achieve that, you might still want to provide as much compatibility as possible with existing tools (it is often easier to achieve compatibility against tools which only read data, rather than with those which also write it). Consequently, you avoid storing version number in such cases, and just try to parse whatever you're given, signalling an error if the input is definitely malformed.
If you come to the point where you absolutely must break compatibility in both directions and start from a clean slate, the suggested way of handling this for WCF data contracts is indeed by changing the namespace, as described in best practices on data contract versioning. There are a few minor variations there as well, such as using publication date instead of version number in the URL (W3C is quite fond of this for their schemas), but these are mostly stylistic.

Good Way To Handle XML Change

Our system stores XML strings in a database. I've recently had to change the Properties on a Class, and now when an XML string gets deserialized it will throw an exception. What is the best way to handle this change? Look for the Node in the application code using XPATH or LINQ, or change the xml string in sql database (ie do a mass update)?.

You might want to look at writing a custom XML deserializer (i.e. implementing IXmlSerializable, see here) to handle changes in your XML. If you've invested a lot of time into crafting your XML serialization attributes, you may want to look at another approach.
Consider batch-upgrading your XML, or deprecating (instead of removing) properties inside of your classes and mapping older behavior to newer behavior.
Longer term, you will want to come up with a strategy for dealing with this in the future, since you will most likely be continue to make changes to your schema/object definitions as you add/change the functionality of your system.

if you serialize the objects to the database you could try the approach I outlined here to load the old versions into a new version then when you save the new version will be saved. Not sure if having different versions of your class will be appropriate though...
Basically you create a factory to produce your objects from the xml. everytime you change your object you create a new factory and a new object class, which is given a version of the old class in its constructor and it creates itself from the old class. The new factory tries to create a new object from the xml, if it can, happy days, if it can't then it creates a new object and tells the next oldest factory to create a next oldest object from the xml. These factories can then be chained together so that you can always load a newest version of the objects from whatever data is in the db.
This assumes that its possible to always create a valid v2 object from a v1 object.

It's a good practice to store a version along your XML strings. Either at the database level or at the class level so that your code knows which version of the class it has to deserialize.
You might also look at XSLT. It allows you to transform one version of XML into another.
In that case the logic to go from one version to another is not handle by code but by the XSLT. You can even store the XSLT into the database which makes it reusable by other programs.

What are the differences between the XmlSerializer and BinaryFormatter

I spent a good portion of time last week working on serialization. During that time I found many examples utilizing either the BinaryFormatter or XmlSerializer. Unfortunately, what I did not find were any examples comprehensively detailing the differences between the two.
The genesis of my curiosity lies in why the BinaryFormatter is able to deserialize directly to an interface whilst the XmlSerializer is not. Jon Skeet in an answer to "casting to multiple (unknown types) at runtime" provides an example of direct binary serialization to an interface. Stan R. provided me with the means of accomplishing my goal using the XmlSerializer in his answer to "XML Object Deserialization to Interface."
Beyond the obvious of the BinaryFormatter utilizes binary serialization whilst the XmlSerializer uses XML I'd like to more fully understand the fundamental differences. When to use one or the other and the pros and cons of each.

The reason a binary formatter is able to deserialize directly to an interface type is because when an object is originally serialized to a binary stream metadata containing type and assembly information is stuck in with the object data. This means that when the binary formatter deserializes the object it knows its type, builds the correct object and you can then cast that to an interface type that object implements.
The XML serializer on the otherhand just serializes to a schema and only serializes the public fields and values of the object and no type information other then that (e.g. interfaces the type implements).
Here is a good post, .NET Serialization, comparing the BinaryFormatter, SoapFormatter, and XmlSerializer. I recommend you look at the following table which in addition to the previously mentioned serializers includes the DataContractSerializer, NetDataContractSerializer and protobuf-net.

Just to weigh in...
The obvious difference between the two is "binary vs xml", but it does go a lot deeper than that:
fields (BinaryFormatter=bf) vs public members (typically properties) (XmlSerializer=xs)
type-metadata based (bf) vs contract-based (xs)
version-brittle (bf) vs version-tolerant (xs)
"graph" (bf) vs "tree" (xs)
.NET specific (bf) vs portable (xs)
opaque (bf) vs human-readable (xs)
As a discussion of why BinaryFormatter can be brittle, see here.
It is impossible to discuss which is bigger; all the type metadata in BinaryFormatter can make it bigger. And XmlSerializer can work very with compression like gzip.
However, it is possible to take the strengths of each; for example, Google have open-sourced their own data serialization format, "protocol buffers". This is:
contract-based
portable (see list of implementations)
version-tolerant
tree-based
opaque (although there are tools to show data when combined with a .proto)
typically "contract first", but some implementations allow implicit contracts based on reflection
But importantly, it is very dense data (no type metadata, pure binary representation, short tags, tricks like variant-length base-7 encoding), and very efficient to process (no complex xml structure, no strings to match to members, etc).
I might be a little biased; I maintain one of the implementations (including several suitable for C#/.NET), but you'll note I haven't
linked to any specific implementation; the format stands under its own merits ;-p

The XML Serializer produces XML and also an XML Schema (implicitly). It will produce XML that conforms to this schema.
One implication is that it will not serialize anything which cannot be described in XML Schema. For instance, there is no way to distinguish between a list and an array in XML Schema, so the XML Schema produced by the serializer can be interpreted either way.
Runtime serialization (which the BinaryFormatter is part of) serializes the actual .NET types to the other side, so if you send a List<int>, the other side will get a List<int>.
That obviously works better if the other side is running .NET.

The XmlSerializer serialises the type by reading all the type's properties that have both a public getter and a public setter (and also any public fields). In this sense the XmlSerializer serializes/deserializes the "public view" of the instance.
The binary formatter, by contrast, serializes a type by serializing the instance's "internals", i.e. its fields. Any fields that are not marked as [NonSerialized] will be serialized to the binary stream. The type itself must be marked as [Serializable] as must any internal fields that are also to be serialized.

I guess one of the most important ones is that binary serialization can serialize both public and private members, whereas the other one works only with public ones.
In here, it provides a very helpful comparison between these two in terms of size. It's a very important issue, because you might send your serialized object to a remote machine.
http://www.nablasoft.com/alkampfer/index.php/2008/10/31/binary-versus-xml-serialization-size/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Serialization for document storage - c#

The one you haven't tried is DataContractSerializer. There is a constructor that takes a parameter bool preserveObjectReferences that should handle the first criteria.

Related

Why SerializationAttribute is not applied by default? [duplicate]

serialize and deserialize objects from different location

Should I use a Namespace of an XML file to identify its version

Good Way To Handle XML Change

What are the differences between the XmlSerializer and BinaryFormatter

Categories

Resources