Mongo Schema-less Collections & C#

Mongo Schema-less Collections & C# - c#

I'm exploring Mongo as an alternative to relational databases but I'm running into a problem with the concept of schemaless collections.
In theory it sounds great, but as soon as you tie a model to a collection, the model becomes your defacto schema. You can no longer just add or remove fields from your model and expect it to continue to work. I see the same problems here managing changes as you have with a relational database in that you need some sort of script to migrate from one version of the database schema to the other.
Am I approaching this from the wrong angle? What approaches do members here take to ensure that their collection items stay in sync with their domain model when making updates to their domain model?
Edit: It's worth noting that these problems obviously exist in relational databases as well, but I'm asking specifically for strategies in mitigating the problem using schemaless databases and more specifically Mongo. Thanks!

Schema migration with MongoDB is actually a lot less painful than with, say, SQL server.
Adding a new field is easy, old records will come in with it set to null or you can use attributes to control the default value [BsonDefaultValue("abc", SerializeDefaultValue = false)]
The [BsonIgnoreIfNull] attribute is also handy for omitting objects that are null from the document when it is serialized.
Removing a field is fairly easy too, you can use [BSonExtraElements] (see docs) to collect them up and preserve them or you can use [BsonIgnoreExtraElements] to simply throw them away.
With these in place there really is no need to go convert every record to the new schema, you can do it lazily as needed when records are updated, or slowly in the background.
PS, since you are also interested in using dynamic with Mongo, here's an experiment I tried along those lines. And here's an updated post with a complete serializer and deserializer for dynamic objects.

My current thought on this is to use the same sort of implementation I would using a relational database. Have a database version collection which stores the current version of the database.
My repositories would have a minimum required version which they require to accurately serialize and deserialize the collections items. If the current db version is lower than the required version, I just throw an exception. Then use migrations which would do all the conversion necessary to update the collections to the required state to be deserialized and update the database version number.

With statically-typed languages like C#, whenever an object gets serialised somewhere, then its original class changes, then it's deserialised back into the new class, you're probably going to run in problems somewhere along the line. It's fairly unavoidable whether it's MongoDB, WCF, XmlSerializer or whatever.
You've usually got some flexibility with serialization options, for example with Mongo you can change a class property name but still have its value map to the same field name (e.g. using the BsonElement attribute). Or you can tell the deserializer to ignore Mongo fields that don't have a corresponding class property, using the BsonIgnoreExtraElements attribute, so deleting a property won't cause an exception when the old field is loaded from Mongo.
Overall though, for any structural schema changes you'll probably need to reload the data or run a migration script. The other alternative is to use C# dynamic variables, although that doesn't really solve the underlying problem, you'll just get fewer serialization errors.

I've been using mongodb for a little over a year now though not for very large projects. I use hugo's csmongo or the fork here. I like the dynamic approach it introduces. This is especially useful for projects where the database structure is volatile.

Related

Serializing complex EF model over JSON

I have done a lot of searching and experimenting and have been unable to find a workable resolution to this problem.
Environment/Tools
Visual Studio 2013
C#
Three tier web application:
Database tier: SQL Server 2012
Middle tier: Entity Framework 6.* using Database First, Web API 2.*
Presentation tier: MVC 5 w/Razor, Bootstrap, jQuery, etc.
Background
I am building a web application for a client that requires a strict three-tier architecture. Specifically, the presentation layer must perform all data access through a web service. The presentation layer cannot access a database directly. The application allows a small group of paid staff members to manage people, waiting lists, and the resources they are waiting for. Based on the requirements the data model/database design is entirely centered around the people (User table).
Problem
When the presentation layer requests something, say a Resource, it is related to at least one User, which in turn is related to some other table, say Roles, which are related to many more Users, which are related to many more Roles and other things. The point being that, when I query for just about anything EF wants to bring in almost the entire database.
Normally this would be okay because of EF's default lazy-load behavior, but when serializing just about any object to JSON for returning to the presentation layer, the Newtonsoft.Json serializer hangs for a long time then blows a stack error.
What I Have Tried
Here is what I have attempted so far:
Set Newtonsoft's JSON serialier ReferenceLoopHandling setting to Ignore. No luck. This is not cyclic graph issue, it is just the sheer volume of data that gets brought in (there are over 20,000 Users).
Clear/reset unneeded collections and set reference properties to null. This showed some promise, but I could not get around Entity Framework's desire to track everything.
Just setting nav properties to null/clear causes those changes to be saved back to the database on the next .SaveChanges() (NOTE: This is an assumption here, but seemed pretty sound. If anyone knows different, please speak up).
Detaching the entities causes EF to automatically clear ALL collections and set ALL reference properties to null, whether I wanted it to or not.
Using .AsNotTracking() on everything threw some exception about not allowing non-tracked entities to have navigation properties (I don't recall the exact details).
Use AutoMapper to make copies of the object graph, only including related objects I specify. This approach is basically working, but in the process of (I believe) performing the auto-mapping, all of the navigation properties are accessed, causing EF to query and resolve them. In one case this leads to almost 300,000 database calls during a single request to the web service.
What I am Looking For
In short, has anyone had to tackle this problem before and come up with a working and performant solution?
Lacking that, any pointers for at least where to look for how to handle this would be greatly appreciated.
Additional Note: It occurred to me as I wrote this that I could possibly combine the second and third items above. In other words, set/clear nav properties, then automap the graph to new objects, then detach everything so it won't get saved (or perhaps wrap it in a transaction and roll it back at the end). However, if there is a more elegant solution I would rather use that.
Thanks,
Dave

It is true that doing what you are asking for is very difficult and it's an architectural trap I see a lot of projects get stuck in.
Even if this problem were solveable, you'd basically end up just having a data layer which just wraps the database and destroys performance because you can't leverage SQL properly.
Instead, consider building your data access service in such a way that it returns meaningful objects containing meaningful data; that is, only the data required to perform a specific task outlined in the requirements documentation. It is true that an post is related to an account, which has many achievements, etc, etc. But usually all I want is the text and the name of the poster. And I don't want it for one post. I want it for each post in a page. Instead, write data services and methods which do things which are relevant to your application.
To clarify, it's the difference between returning a Page object containing a list of Posts which contain only a poster name and message and returning entire EF objects containing large amounts of irrelevant data such as IDs, auditing data such as creation time.
Consider the Twitter API. If it were implemented as above, performance would be abysmal with the amount of traffic Twitter gets. And most of the information returned (costing CPU time, disk activity, DB connections as they're held open longer, network bandwidth) would be completely irrelevant to what developers want to do.
Instead, the API exposes what would be useful to a developer looking to make a Twitter app. Get me the posts by this user. Get me the bio for this user. This is probably implemented as very nicely tuned SQL queries for someone as big as Twitter, but for a smaller client, EF is great as long as you don't attempt to defeat its performance features.
This additionally makes testing much easier as the smaller, more relevant data objects are far easier to mock.

For three tier applications, especially if you are going to expose your entities "raw" in services, I would recommend that you disable Lazy Load and Proxy generation in EF. Your alternative would be to use DTO's instead of entities, so that the web services are returning a model object tailored to the service instead of the entity (as suggested by jameswilddev)
Either way will work, and has a variety of trade-offs.
If you are using EF in a multi-tier environment, I would highly recommend Julia Lerman's DbContext book (I have no affiliation): http://www.amazon.com/Programming-Entity-Framework-Julia-Lerman-ebook/dp/B007ECU7IC
There is a chapter in the book dedicated to working with DbContext in multi-tier environments (you will see the same recommendations about Lazy Load and Proxy). It also talks about how to manage inserts and updates in a multi-tier environment.

i had such a project which was the stressful one .... and also i needed to load large amount of data and process them from different angles and pass it to complex dashboard for charts and tables.
my optimization was :
1-instead of using ef to load data i called old-school stored procedure (and for more optimization grouping stuff to reduce table as much as possible for charts. eg query returns a table that multiple charts datasets can be extracted from it)
2-more important ,instead of Newtonsoft's JSON i used fastJSON which performance was mentionable( it is really fast but not compatible with complex object. simple example may be view models that have list of models inside and may so on and on or )
better to read pros and cons of fastJSON before
https://www.codeproject.com/Articles/159450/fastJSON
3-in relational database design who is The prime suspect of this problem it might be good to create those tables which have raw data to process in (most probably for analytics) denormalized schema which save performance on querying data.
also be ware of using model class from EF designer from database for reading or selecting data especially when u want serialize it(some times i think separating same schema model to two section of identical classes/models for writing and reading data in such a way that the write models has benefit of virtual collections came from foreign key and read models ignore it...i am not sure for this).
NOTE: in case of very very huge data its better go deeper and set up in-memory table OLTP for the certain table contains facts or raw data how ever in that case your table acts as none relational table like noSQL.
NOTE: for example in mssql you can use benefits of sqlCLR which let you write scripts in c#,vb..etc and call them by t-sql in other words handle data processing from database level.
4-for interactive view which needs load data i think its better to consider which information might be processed in server side and which ones can be handled by client side(some times its better to query data from client-side ... how ever you should consider that those data in client side can be accessed by user) how ever it is situation-wise.
5-in case of large raw data table in view using datatables.min.js is a good idea and also every one suggest using serverside-paging on tables.
6- in case of importing and exporting data from big files oledb is a best choice i think.
how ever still i doubt them to be exact solutions. if any body have practical solutions please mention it ;) .

I have fiddled with a similar problem using EF model first, and found the following solution satisfying for "One to Many" relations:
Include "Foreign key properties" in the sub-entities and use this for later look-up.
Define the get/set modifiers of any "Navigation Properties" (sub-collections) in your EF entity to private.
This will give you an object not exposing the sub-collections, and you will only get the main properties serialized. This workaround will require some restructuring of your LINQ queries, asking directly from your table of SubItems with the foreign key property as your filtering option like this:
var myFitnessClubs = context.FitnessClubs
?.Where(f => f.FitnessClubChainID == myFitnessClubChain.ID);
Note 1:
You may off-cause choose to implement this solution partly, hence only affecting the sub-collections that you strongly do not want to serialize.
Note 2:
For "Many to Many" relations, at least one of the entities needs to have a public representation of the collection. Since the relation cannot be retrieved using a single ID property.

Does setting serialization mode to Unidirectional have any drawback or performance hit?

I want to add a history feature for one of my application's entities. My application is using linq to sql as O/RM.
The solution I have selected for saving previous versions of entity is to serialize old version and save serialized versions somewhere.
Anyway, I was looking for the best way to serialize linq to sql entities. One of the best solutions I found was simply setting serialization mode to Unidirectional in DataContext object, and then use .net serializer. It seems that the method works well.
But, my question is, do this method have any side effect on data context or any performance hit? For example, does lazy loading of objects works intact after turning unidirectional serialization mode on?
I've searched a lot but found nothing.
And, do you think I should go another way to serialize objects? Any cheaper method?
PS 1: I don't like using DTOs, because I should maintain compatibility between DataContext object and DTOs.
PS 2: Please note, that the application I'm working on is relatively large and currently under heavy load. So I should be careful about changes in O/RM behavior or performance issues, because I could not review the entire application + online versions.
Thanks a lot :)

Re "and then use .net serializer", the intent of unidirectional mode is to support DataContractSerializer (specifically). The effect is basically just the addition of the [DataContract]/[DataMember] flags, but: based on experience, and despite your reservations, I would strongly advise you to go the DTO route:
ORMs, due to lazy loading of both sub-collections and properties (sometimes) makes it very hard to predict exacty what is going to happen
and at the same time, make it hard to monitor when data load is happening (N+1 in your serialization code can go un-noticed)
you have very little control over the size of the model (how many levels of descendants etc) are going to be serialized
and don't even get me started on parent navigation
the deserialized objects will not have a data-context, which means that their lazy loading won't work - this can lead to unexpected behaviours
you can get into all sorts of problems if you can't version your data model separately from your serialization model
there are some... "peculiarities" that impact LINQ-to-SQL and serialization (from memory, the most "fun" one is a difference in how the IList.Add and IList<T>.Add is implemented, meaning that one triggers data-change flags and one doesn't; nasty)
you are making everything dependent on your data model, which also means that you can never change your data model (at least, not easily)
a lot of scenarios that involve DTO are read-based, not write-based, so you might not even need the ORM overheads in the first place - for example, you might be able to replace some hot code-path pieces with lighter tools (dapper, etc) - you can't do that well if everything
is tied to the data model
I strongly advocate biting the bullet and creating a DTO model that matches what you want to serialize; the DTO model would be incredibly simple, and should only take a few minutes to throw together, including the code to map to/from your data model. This will give you total control over what is serialized, while also giving you freedom to iterate without headaches, and do things like use different serializers

serializing LinqToSql generated entities keeping relations and lazy loading

We have a rather large ASP.NET MVC project using LINQ to SQL which we are in the process of migrating to Windows Azure.
Now, we need to serialize objects for storing in the Azure distributed cache, and setting "Serialization Mode" to "Unidirectional" in the .dbml file, and thus automatically decorating the generated classes and properties with DataContract and DataMember attributes accordingly, seems to be the recommended way. However, this makes any relationships not yet loaded by LINQ to SQL to be lost when serializing and saved as null.
What would be the prefered way of proceeding, taking into account a couple of things:
As mentioned, it's a rather large project with the generated *.designer.cs file being close to 1.5MB
Disabling lazy loading completely would most likely be a big
performance hit due to many deep class relationships.
Changing ORM tool is something we are considering, but doing this at the same time as switching platforms would probably be a bad thing.
If this boils down to somehow manually specifying which objects and relations to serialize accross the whole project; using something like protobuf-net for some extra performance gains would probably not be a huge step.

However, this makes any relationships not yet loaded by LINQ to SQL to be lost when serializing and saved as null.
Yes, this is normal and expected when serializing - you are essentially taking a snapshot of what was available at that time, because lazy-loading depends on it being loaded via a data-context. It would be inadviseable for any tool to crawl the entire model looking for things to load, because that could keep going indefinitely, essentially bringing large chunks of unwanted data into play.
Options:
explicitly fetch (either pre-emptively via "loadwith", or by hitting the appropriate properties) the data you are interested in before serializing
or, load the data into a completely separate DTO model for serialization - in many ways, this is a re-statement of the first, since it will by necessity involve iterating over (projecting) the data you want, but it means you are creating the DTO to suit the exact shape you actually want to send

C#: Save serialized object to the text field in the DB, how maintain object format versions in the future?

I've complex object (nested properties, collections etc) in my ASP.NET MVC C# application. I don't need to save it into the multiple tables in the DB, serializing the whole object and store it like a whole is ok.
I plan to serialize the whole object (in something human-readable like JSON/XML) and store in text field in the DB.
I need to later load this object from the DB and render it using strongly-typed view.
Here comes the question: in the future the class of the object can change (I can add\remove fields etc). But serialized versions saved into the DB before will not reflect change.
How to deal with this?

You should write some sort of conversion utility every time you significantly change structured, serialized messages, and run it as part of an upgrade process. Adding or removing fields that are nullable isn't likely to be a problem, but larger structural changes will be.
You could do something like implement IXmlSerializable, peek at the message and figure out what version the message is and convert it appropriately, but this will quickly become a mess if you have to do this a lot and your application has a long lifecycle. So, you're better off doing it up front in an upgrade process, and outside of your application.
If you are worried about running the conversion on lots of records as part of an upgrade, you could come up with some ways to make it more efficient (for example, add a column to the table that contains the message schema version, so you can efficiently target messages that are out of date).

As long as you're using JSON or XML, added fields shouldn't be a problem (as long as no specific version schemas are enforced), The default .net XML serializer for instance, doesn't include fields that have their default value (which can be set with the System.Component.DefaultValue attribute). So the new fields will be treated the same as the omitted fields while deserializing and get their default values (default class values that is, the DefaultValue attribute only applies to serialization/designer behaviour).
Removed fields depends on your deserialization implementation, but can be made so that those are ignored. Personally I tend to keep the properties but mark them as obsolete with a message of what they were once for. That way when coding you'll know not to use them, but can still be filled for backwards compatibility (and they shouldn't be serialized when marked obsolete). When possible you could implement logic in the obsolete property that fills the renewed data structure.

Is there a way to make a query that looks in serialized binary object in SQL Server 2008?

I have an object called Data serialized as varbinary(MAX). Data object contains property named Source. Is there a way to make something similar:
select * from content_table where Data.Source == 'Feed'
I know that it is possible when XML serialization is used (XQuery). But serialization type cannot be changed in this case.

If you have used BinaryFormatter, then no, not really - at least, not without deserilizing the entire object model, which is usually not possible at the database. It is an undocumented format, with very little provision for ad-hoc query.
Note: BinaryFormatter is not (IMO) a good choice for anything that relates to storage of items; I fully expect that this will bite you at some point (i.e. unable to reliably deserialize the data you have stored). Pain points:
tightly tied to type names; can break as you move code around
tightly tied to field names; can break as you refactor your classes (even just creating to an automatically implemented property is a breaking change)
prone to including a bigger graph than you expect, in particular via events
It is of course also platform-specific, and potentially framework-specific.
In all seriousness, I've lost count of the number of "I can't deserialize my data" questions I've fielded over the years...
There are alternative binary serializers that do allow some (limited) ability to inspect the data via a reader (without requiring full deserialization), and which do not become tied to the type metadata (instead, being contract-based, allowing deserialization into any suitable type model - not just that specific type/version.
However, I genuinely doubt that such work work with anything approaching efficiency in a WHERE clause etc; you would need a SQL/CLR method etc. IMO, a better approach here is simply to store the required filter columns as data in other columns, allowing you to add indexing etc. On the rare occasions that I have used the xml type, this is the same as I have done there (with the small caveat that you can use "lifted" computed+stored+indexed columns from the underlying xml data, which wouldn't be possible here - the extra columns would have to be explicit).

You could deserialise the data, using a SQL CLR function. But I suspect it wont be fast. It all depends on how the serialisation was done. If the library is available, then a simple CLR function, shuld be able to query the data quite easily.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.