Situational Background: XSD with SCH
XML Schema (XSD)
I have an XML schema definition ("the schema") that includes several other XSDs, all in the same namespace. Some of those import other XSDs from foreign namespaces. All in all, the schema declares several global elements that can be instantiated as XML documents. Let's call them Global_1, Global_2 and Global_3.
Business Rules (SCH)
The schema is augmented by a Schematron file that defines the "business rules". It defines a number of abstract rules, and each abstract rule contains a number of assertions using the data model defined via XSD. For instance:
<sch:pattern>
<sch:rule id="rule_A" abstract="true">
<sch:assert test="if (abc:a/abc:b = '123') then abc:x/abc:y = ('aaa', 'bbb', 'ccc') else true()" id="A-01">Error message</sch:assert>
<sch:assert test="not(abc:c = 'abcd' and abc:d = 'zz')" id="A-02">Some other error message</sch:assert>
</sch:rule>
<!-- (...) -->
</sch:pattern>
Each abstract rule is extended by one or more non-abstract (concrete) rule that defines a specific context in which the abstract rule's assertions are to be validated. For example:
<sch:pattern>
<!-- (...) -->
<sch:rule context="abc:Global_1/abc:x/abc:y">
<sch:extends rule="rule_A"/>
</sch:rule>
<sch:rule context="abc:Global_2/abc:j//abc:k/abc:l">
<sch:extends rule="rule_A"/>
</sch:rule>
<!-- (...) -->
</sch:pattern>
In other words, all the assertions defined within the abstract rule_A are being applied to their specific contexts.
Both "the schema" and "the business rules" are subject to change - my program gets them at run-time and I don't know their content at design-time. The only thing I can safely assume is that there are no endless recursive structures in the schema: There is always one definite leaf node for every type and no type contains itself. Put differently, there are no "infinite loops" possible in the instances.
The Problem I want To Solve
Basically, I want to evaluate programmatically if each of the defined rules is correct. Since correctness can be quite a problematic topic, here by correctness I simply mean: Each XPath used in a rule (i.e. its context and within the XQueries of its inherited assertions) is "possible", meaning it can exist according to the data model defined in the schema. If, for instance, a namespace prefix is forgotten (abc:a/b instead of abc:a/abc:b), this XPath will never return anything other than an empty node set. The same is true if one step in the XPath is accidentally omitted, or spelled wrong, etc. This is obviously not a very strong claim for "correctness" of such a rule, but it'll do for a first step.
My Approach Towards A Solution For This
At least to me it doesn't seem like a trivial problem to evaluate an XPath (not to speak of the entire XQuery!) designed for the instance of a schema against the actual schema, given how it may contain axis steps like //, ancestor::, sibling::, etc. So I decided to construct something I would call a "maximum instance": By recursively iterating through all global elements and their children (and the structure of their respective complex types etc.), I build an XML instance at run-time that contains every possible element and attribute where it would be in the normal instance, but all at once. So every optional element/attribute, every element within a choice block and so on. So, said maximum instance would look something like this:
<maximumInstance>
<Global_1>
<abc:a>
<abc:b additionalAttribute="some_fixed_value">
<abc:j/>
<abc:k/>
<abc:l/>
</abc:b>
</abc:a>
</Global_1>
<Global_2>
<abc:x>
<abc:y>
<abc:a/>
<abc:z>
<abc:l/>
</abc:z>
</abc:y>
</abc:x>
</Global_2>
<Global_3>
<!-- ... -->
</Global_3>
<!-- ... -->
</maximumInstance>
All it takes now is to iterate over all abstract rules: And for every assertion in each abstract rule it must be checked that for every context the respective abstract rule is extended by, every XPath within an assertion results in a non-empty node set when evaluated against the maximum instance.
Where I'm stuck
I have written a C# (.NET Framework 4.8) program that parses "the schema" into said "maximum instance" (which is an XDocument at run-time). It also parses the business rules into a structure that makes it easy to get each abstract rule, its assertions, and the contexts these assertions are to be validated against.
But currently, I only have each complete XQuery (just like they are in the Schematron file) which effectively creates an assertion. But I actually need to break the XQuery down into its components (I guess I'd need the abstract syntax tree) so that I would have all individual XPaths. For instance, when given the XQuery if (abc:a/abc:b = '123') then abc:x/abc:y = ('aaa', 'bbb', 'ccc') else true(), I would need to retrieve abc:a/abc:b and abc:x/abc:y.
I assume that this could be done using Saxon-HE (or maybe another Parser/Compiler currently available for C# I don't know about). Unfortunately, I have yet to understand how to make use of Saxon well enough to even find at least a valid starting point for what I want to achieve. I've been trying to use the abstract syntax tree (so I can access the respective XPaths in the XQuery) seemingly accessible via XQueryExecutable:
Processor processor = new Processor();
XQueryCompiler xqueryCompiler = processor.NewXQueryCompiler();
XQueryExecutable exe = xqueryCompiler.Compile(xquery);
var AST = exe.getUnderlyingCompiledQuery();
var st = new XDocument();
st.Add(new XElement("root"));
XdmNode node = processor.NewDocumentBuilder().Build(st.CreateReader());
AST.explain((node); // <-- this is an error!
But that doesn't get me anywhere: I don't find any properties exposed I could work with? And while VS offers me to use AST.explain(...) (which seems promising), I'm unable to figure out what to parametrize here. I tried using a XdmNode which I thought would be a Destination? But also, I am using Saxon 10 (via NuGet), while Destination seems to be from Saxon 9: net.sf.saxon.s9api.Destination?!
Does anybody who was kind enough to read through all of this have any advice for me on how to tackle this? :-) Or, maybe there's a better way to solve my problem I haven't thought of - I'm also grateful for suggestions.
TL;DR
Sorry for the wall of text! In short: I have Schematron rules that augment an XML schema with business logic. To evaluate these rules (not: validate instances against the rules!) without actual XML instances, I need to break down the XQueries which make up the Schematron's assertions into their components so that I can handle all XPaths used in them. I think it can be done with Saxon-HE, but my knowledge is too limited to even understand what a good starting point what be for that. I'm also open for suggestions regarding a possibly better approach to solve my actual problem (as described in detail above).
Thank you for taking the time to read this.
If this were an XSD schema rather than a Schematron schema, then Saxon-EE would do the job for you automatically: this is very similar what a schema-aware XQuery processor attempts to do. But another difference is that in schema-aware XQuery, you can't assume that every element named foo is a valid instance of the element declaration named foo in the schema; it's quite legitimate, for example, for a query to transform valid instances into invalid instances, or vice versa. The input and output, after all, might conform to different schemas.
Saxon uses path analysis to do this: it looks at path expressions to see "where they might lead". Path analysis is also used to assess streamability, and to support document projection (building a trimmed-down tree representation of the source document that leaves out the parts that the query cannot reach). The path analysis in Saxon is by no means complete, for example it doesn't attempt to handle recursive functions. Although all these operations require Saxon-EE, the basic path analysis code is actually present in Saxon-HE, but I would offer no guarantee that it works for any purpose other than those described.
You're basically right that this is a tough problem you've set yourself, and I wish you luck with it.
Another approach you could adopt that wouldn't involve grovelling around the Saxon internals is to convert the XQuery to XQueryX, which is an XML representation of the parse tree, and then inspect the XQueryX (presumably using XQuery) to find the parts you need.
While XQueryX (as pointed out by Michael Kay) would theoretically have been exactly what I was looking for, unfortunately I could not find anything useful regarding an implementation for .NET during my research.
So I eventually solved the whole thing by creating my own parser using the XPath3.1 grammar for ANTLR4 as an ideal starting point. This way, I am now able to retrieve a syntax tree of any Schematron rule expression, allowing me to extract each contained XPath expression (and its sub expressions) separately.
Note that another stumbling block has been the fact that .NET still (!) only handles XPath 1.0 genuinely: While my parser does everything as supposed to, for some of the found expressions .NET gave me "illegal token" errors when trying to evaluate them. Installing the XPath2 NuGet package by Chertkov/Heyenrath was the solution.
I'm trying to deserialize an unstructured JSON object (there are multiple schema possibilities) to a BsonDocument, and I'm trying to specify the correct types for some properties (which I know in advance).
Say for example (very simplified example):
{ "Id": "039665be-a1a8-4062-97d6-e44fea2affff", "Foo":"Bar", "Baz":30 }
So I know everytime I find an "Id" property (which may or may not be there) on the root of the object, this is to be converted to an UUID type (bson binary type 4).
I've made a simple JsonReader descendant, and I'm overriding ReadBsonType, and both returning a new CurrentBsonType and providing a converted value there, then overriding all methods for every possible type (ReadDateTime(), ReadInt32(), ReadInt64(), ReadBinaryData(), etc.) and providing a parsed value.
This works fine (albeit I find it a bit unconfortable) when the JSON object is plain, but if it has nested objects with properties of the same name (which I do not want to parse), then problems arise.
I've tried overriding ReadStartArray(), ReadStartDocument(), etc., and tried building a "path" which I can query to, but the actual order of the calling of the methods of the JsonReader are baffling to me (it seems to check for the type before checking for the name of the property, so the CurrentName property when checking for the type refers to the previous property, etc.).
I've somewhat circumvented it with very ugly code... and I'm sure there must be a better way to do this without class mapping, although finding documentation is proving hard (since mongo often calls "json" the actual "json extended", so documentation gets mixed from here and there).
Has anyone ever found themselves in such situation?
PS: before anyone asks, I'm storing data returned as json strings from a third party server on a mongo database (that will be mined later on), and while there are some schemas for the datatypes available (and I could classmap them), there might be new schemas in the future so I can't just classmap everything. Some properties (if they exist) are always the same though, so instead of storing everything on mongo as a string I'd rather give them the correct possible types from a start.
I have been working on a Windows Form Control project to import into a 3rd party client software using their supplied SDK. The custom control written by yet another company I am trying to load requires sign on to a server before displaying information, which can take 20-30 seconds. In order to speed things up I had the idea to pre-load information needed by the control to a text file. Since it is not a known type it is throwing errors when trying to serialize the class.
I have a Dictionary I am using to reference back to the proper ICamera class. If I change "cam" from an ICamera type to a string, for example "cam.GetLiveURL()". It writes the text file without issue. This is the code I am using to populate the Dictionary.
foreach (ICamera cam in _adapter.Cameras())
{
OCCamera.Add(cam.GetDisplayName(), cam);
}
I have tried XMLSerializer, and it seems it has difficulty dealing with a Dictionary.
I have tried BinaryFormatter and get the error:
Type 'OCAdapter.OCCamera' in Assembly 'OCAdapter.dll' in not marked as serializable.
I have tried DataContractSerializer and get the error:
Type 'OCAdapter.OCCamera' with data contract name
'OCCamera:http://schemas.datacontract.org/2004/07/OCAdapter' is not
expected. Consider using a DataContractResolver or add ant types not
known statically to the list of known types - for example, by using
the KnownTypeAttribute attribute or by adding the to the list of known
types passed to DataContractSerializer.
I have tried playing around with the DataContractResolver and can not seem to get it to work, I do not understand it at all.
The code I am using for the BinaryFormatter and DataContractSerializer are straight from MSDN or elsewhere, and test fine without the custom type.
Maybe there is a better way to handle all this, and I am missing it. I am not opposed to ditching the Dictionary route for something else, or I can rewrite any amount of other code to make this work.
Mistake 1: trying to serialize your implementation rather than the *data.
Mistake 2: using BinaryFormatter... just about ever (except maybe AppDomain marshalling)
My advice: create a simple model ("DTO" model) that just represents the data you need, but not in terms of your specific implementation (no OCAdapter.OCCamera etc). You can construct this DTO model in whatever way is convenient for whatever serialization library you like. I'm partial to protobuf-net, but many others exist. Then map to/from your DTO model and your implementation model.
Advantages:
it'll work
changes to the implementation don't impact the data; it only impacts the mapping code
you can use just about any serializer you want
you can version the data sensibly
Lets say my c# model updated while correspondent collection still contains old documents, I want old and new documents to coexist in the collection, while using only new version of c# model to read them. I wish no inheritance is used if possible. So I wonder which of this issues are solvable and how:
there is a new property in c# model which does not present in database. I think it never should be an issue, Mongo knows nothing about it, and it will be initialized with default value. The only issue here is to initialize it with particular value for all old documents, anybody knows how?
one of property has gone from model. I want MongoDb to find out there is no more property in c# class to map the field of old document to, and to ignore it instead of crashing. This scenario probably sounds a bit strange as it would mean some garbage left in database, but anyway, is the behavior possible to implement/configure?
type if changed, new type is convertible to old one, like integer->string. Is there any way to configure mapping for old docs?
I can consider using inheritance for second case if it is not solvable otherwise
Most of the answers to your questions are found here.
BsonDefaultValue("abc") attribute on properties to handle values not present in the database, and to give them a default value upon deserialization
BsonIgnoreExtraElements attribute on the class to ignore extra elements found during deserialization (to avoid the exception)
A custom serializer is required to handle if the type of a member is changed, or you need to write an upgrade script to fix the data. It would probably be easier to leave the int on load, and save to a string as needed. (That will mean that you'll need a new property name for the string version of the property.)
I have a class that serializes a set of objects (using XML serialization) that I want to unit test.
My problem is it feels like I will be testing the .NET implementation of XML serialization, instead of anything useful. I also have a slight chicken and egg scenario where in order to test the Reader, I will need a file produced by the Writer to do so.
I think the questions (there's 3 but they all relate) I'm ultimately looking for feedback on are:
Is it possible to test the Writer, without using the Reader?
What is the best strategy for testing the reader (XML file? Mocking with record/playback)? Is it the case that all you will really be doing is testing property values of the objects that have been deserialized?
What is the best strategy for testing the writer!
Background info on Xml serialization
I'm not using a schema, so all XML elements and attributes match the objects' properties. As there is no schema, tags/attributes which do not match those found in properties of each object, are simply ignored by the XmlSerializer (so the property's value is null or default). Here is an example
<MyObject Height="300">
<Name>Bob</Name>
<Age>20</Age>
<MyObject>
would map to
public class MyObject
{
public string Name { get;set; }
public int Age { get;set; }
[XmlAttribute]
public int Height { get;set; }
}
and visa versa. If the object changed to the below the XML would still deserialize succesfully, but FirstName would be blank.
public class MyObject
{
public string FirstName { get;set; }
public int Age { get;set; }
[XmlAttribute]
public int Height { get;set; }
}
An invalid XML file would deserialize correctly, therefore the unit test would pass unless you ran assertions on the values of the MyObject.
Do you need to be able to do backward compatibility? If so, it may be worth building up unit tests of files produced by old versions which should still be able to be deserialized by new versions.
Other than that, if you ever introduce anything "interesting" it may be worth a unit test to just check you can serialize and deserialize just to make sure you're not doing something funky with a readonly property etc.
I would argue that it is essential to unit test serialization if it is vitally important that you can read data between versions. And you must test with "known good" data (i.e. it isn't sufficient to simply write data in the current version and then read it again).
You mention that you don't have a schema... why not generate one? Either by hand (it isn't very hard), or with xsd.exe. Then you have something to use as a template, and you can verify this just using XmlReader. I'm doing a lot of work with xml serialization at the moment, and it is a lot easier to update the schema than it is to worry about whether I'm getting the data right.
Even XmlSerializer can get complex; particularly if you involve subclasses ([XmlInclude]), custom serialization (IXmlSerializable), or non-default XmlSerializer construction (passing additional metadata at runtime to the ctor). Another possibility is creative use of [XmlIngore], [XmlAnyAttribute] or [XmlAnyElement]; for example you might support unexpected data for round-trip (only) in version X, but store it in a known property in version Y.
With serialization in general:
The reason is simple: you can break the data! How badly you do this depends on the serializer; for example, with BinaryFormatter (and I know the question is XmlSerializer), simply changing from:
public string Name {get;set;}
to
private string name;
public string Name {
get {return name;}
set {name = value; OnPropertyChanged("Name"); }
}
could be enough to break serialization, as the field name has changed (and BinaryFormatter loves fields).
There are other occasions when you might accidentally rename the data (even in contract-based serializers such as XmlSerializer / DataContractSerializer). In such cases you can usually override the wire identifiers (for example [XmlAttribute("name")] etc), but it is important to check this!
Ultimately, it comes down to: is it important that you can read old data? It usually is; so don't just ship it... prove that you can.
For me, this is absolutely in the Don't Bother category. I don't unit test my tools. However, if you wrote your own serialization class, then by all means unit test it.
If you want to ensure that the serialization of your objects doesn't break, then by all means unit test. If you read the MSDN docs for the XMLSerializer class:
The XmlSerializer cannot serialize or deserialize the following:Arrays of ArrayListArrays of List<T>
There is also a peculiar issue with enums declared as unsigned longs. Additionally, any objects marked as [Obsolete] do no get serialized from .Net 3.5 onwards.
If you have a set of objects that are being serialized, testing the serialization may seem odd, but it only takes someone to edit the objects being serialized to include one of the unsupported conditions for the serialisation to break.
In effect, you are not unit testing XML serialization, you are testing that your objects can be serialized. The same applies for deserialization.
Yes, as long as what needs to be tested is properly tested, through a bit of intervention.
The fact that you're serializing and deserializing in the first place means that you're probably exchanging data with the "outside world" -- the world outside the .NET serialization domain. Therefore, your tests should have an aspect that's outside this domain. It is not OK to test the Writer using the Reader, and vice versa.
It's not only about whether you would just end up testing the .NET serialization/deserialization; you have to test your interface with the outside world -- that you can output XML in the expected format and that you can properly consume XML in the anticipated format.
You should have static XML data that can be used to compare against serialization output and to use as input data for deserialization.
Assume you give the job of note taking and reading the notes back to the same guy:
You - Bob, I want you to jot down the following: "small yellow duck."
Bob - OK, got it.
You - Now, read it back to me.
Bob - "small yellow duck"
Now, what have we tested here? Can Bob really write? Did Bob even write anything or did he memorize the words? Can Bob actually read? -- his own handwriting? What about another person's handwriting? We don't have answers to any of these questions.
Now let's introduce Alice to the picture:
You - Bob, I want you to jot down the following: "small yellow duck."
Bob - OK, got it.
You - Alice, can you please check what Bob wrote?
Alice - OK, he's got it.
You - Alice, can you please jot down a few words?
Alice - Done.
You - Bob, can you please read them?
Bob - "red fox"
Alice - Yup, that sounds right.
We now know, with certainty, that Bob can write and read properly -- as long as we can completely trust Alice. Static XML data (ideally tested against a schema) should sufficiently be trustworthy.
In my experience it is definitely worth doing, especially if the XML is going to be used as an XML document by the consumer. For example, the consumer may need to have every element present in the document, either to avoid null checking of nodes when traversing or to pass schema validation.
By default the XML serializer will omit properties with a null value unless you add the [XmlElement(IsNullable = true)] attribute. Similarly, you may have to redirect generic list properties to standard arrays with an XMLArray attribute.
As another contributor said, if the object is changing over time, you need to continuously check that the output is consistent. It will also protect you against the serializer itself changing and not being backwards compatible, although you'd hope that this doesn't happen.
So for anything other than trivial uses, or where the above considerations are irrelevant, it is worth the effort of unit testing it.
There are a lot of types that serialization can not cope with etc. Also if you have your attributes wrong, it is common to get an exception when trying to read the xml back.
I tend to create an example tree of the objects that can be serialized with at least one example of each class (and subclass). Then at a minimum serialize the object tree to a stringstream and then read it back from the stringstream.
You will be amazed the number of time this catches a problem and save me having to wait for the application to start up to find the problem. This level of unit testing is more about speeding up development rather then increasing quality, so I would not do it for working serialization.
As other people have said, if you need to be able to read back data saved by old versions of your software, you had better keep a set of example data files for each shipped version and have tests to confirm you can still read them. This is harder then it seems at first, as the meaning of fields on a object may change between versions, so just being able to create the current object from a old serialized file is not enough, you have to check that the meaning is the same as it was it the version of the software that saved the file. (Put a version attribute in your root object now!)
I agree with you that you will be testing the .NET implementation more than you'll be testing your own code. But if that's what you want to do (perhaps you don't trust the .NET implementation :) ), I might approach your three questions as follows.
Yes, it's certainly possible to test the writer without the reader. Use the writer to serialize the example (20-year old Bob) you provided to a MemoryStream. Open the MemoryStream with an XmlDocument. Assert the root node is named "MyObject". Assert it has one attribute named "Height" with value "300". Assert there is a "Name" element containing a text node with value "Bob". Assert there is an "Age" element containing a text node with value "20".
Just do the reverse process of #1. Create an XmlDocument from the 20-year old Bob XML string. Deserialize the stream with the reader. Assert the Name property equals "Bob". Assert the Age property equals 20. You can do things like add test case with insignificant whitespace or single quotes instead of double-quotes to be more thorough.
See #1. You can extend it by adding what you consider to be tricky "edge" cases you think could break it. Names with various Unicode characters. Extra long names. Empty names. Negative ages. Etc.
I have done this in some cases... not testing the serialisation as such, but using some 'known good' XML serializations and then loading them into my classes, and checking that all the properties (as applicable) have the expected values.
This is not going to test anything for the first version... but if the classes ever evolve I know I will catch any breaking changes in the format.
We do acceptance testing of our serialization rather than unit testing.
What this means is that our acceptance testers take the XML schema, or as in your case some sample XML, and re-create their own serializable data-transfer class.
We then use NUnit to test our WCF service with this clean-room XML.
With this technique we've identified many, many errors. For example, where we have changed the name of the .NET member and forgotten to add an [XmlElement] tag with a Name = property.
If there's nothing you can do to change the way your class serializes, then you're testing .NET's implementation of XML serialization ;-)
If the format of the serialized XML matters, then you need to test the serialization. If it's important that you can deserialize it, then you need to test deserialization.
Seeing how you can't really fix serialization, you shouldn't be testing it - instead, you should be testing your own code and the way it interacts with the serialization mechanism. For example, you might need to unit-test the structure of the data you're serializing to make sure that no-one accidentally changes a field or something.
Speaking of which, I have recently adopted a practice where I check such things at compile-time rather than during execution of unit tests. It's a bit tedious, but I have a component that can traverse the AST, and then I can read it in a T4 template and write lots of #error messages if I meet something that shouldn't be there.